DataFrame.values on selected column - python

I have the following error when i try to get not all values but only specified column. I think the error comes from the column I specify after .values
Any help would be appreciated.
supp_bal dataframe:
circulating_supply total_supply
currency
0xBTC 4758600 20999984
1337 26456031141 29258384256
1SG 2187147 22227000
1ST 85558370 93468691
1WO 20981450 37219452
1X2 0 3051868
2GIVE 521605983 521605983
42 41 41
611 478519 478519
777 0 10000000000
A 26842657 278273649
AAA 15090818 397000000
pos_bal dataframe:
2019-07-23 2019-07-24
app_vendor_id currency
3 1WO 2604 2304
ABX 44 44
ADH 822 82
ALX 25 200
AMLT 3673 367
BCH -41 -26
my code:
f = pos_bal.index.get_level_values('currency')
supp_bal['circulating_supply'].loc[f].values['circulating_supply']
error:
pos_bal['circulating_supply'] = supp_bal['circulating_supply'].loc[f].values['circulating_supply']
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

You don't need to use the column name after values,
this should work:
import pandas as pd
import numpy as np
supp_bal = pd.read_csv('D:\\supp_bal.csv', header=0)
pos_bal = pd.read_csv('D:\\pos_bal.csv', header=0)
supp_bal = supp_bal.set_index('currency')
pos_bal = pos_bal.set_index(['app_vendor_id', 'currency'])
display(supp_bal)
display(pos_bal)
f = pos_bal.index.get_level_values('currency')
pos_bal['circulating_supply']= supp_bal['circulating_supply'].loc[f].values
display(pos_bal)
The output
circulating_supply total_supply
currency
0xBTC 4758600 20999984
1337 26456031141 29258384256
1SG 2187147 22227000
1ST 85558370 93468691
1WO 20981450 37219452
1X2 0 3051868
2GIVE 521605983 521605983
42 41 41
611 478519 478519
777 0 10000000000
A 26842657 278273649
AAA 15090818 397000000
7/23/2019 7/24/2019
app_vendor_id currency
3 1WO 2604 2304
ABX 44 44
ADH 822 82
ALX 25 200
AMLT 3673 367
Final pos_bal
7/23/2019 7/24/2019 circulating_supply
app_vendor_id currency
3 1WO 2604 2304 20981450.0
ABX 44 44 NaN
ADH 822 82 NaN
ALX 25 200 NaN
AMLT 3673 367 NaN
Note, in the data you provided, only 1WO appears in both DataFrames, that's why the other rows are all NaN.
btw, I have pandas 0.24.2.

Do you mean by:
f = pos_bal.index.get_level_values('currency')
supp_bal['circulating_supply'].loc[f]

Related

Fishers Exact Test from Pandas Dataframe

I'm trying to work out the best way to create a p-value using Fisher's Exact test from four columns in a dataframe. I have already extracted the four parts of a contingency table, with 'a' being top-left, 'b' being top-right, 'c' being bottom-left and 'd' being bottom-right. I have started including additional calculated columns via simple pandas calculations, but these aren't necessary if there's an easier way to just use the 4 initial columns. I have over 1 million rows when including an additional set (x.type = high), so want to use an efficient method. So far this is my code:
import pandas as pd
import glob
import math
path = r'directory_path'
all_files = glob.glob(path + "/*.csv")
li = []
for filename in all_files:
df = pd.read_csv(filename, index_col=None, header=0)
li.append(df)
frame = pd.concat(li, axis=0, ignore_index=True)
frame['a+b'] = frame['a'] + frame['b']
frame['c+d'] = frame['c'] + frame['d']
frame['a+c'] = frame['a'] + frame['c']
frame['b+d'] = frame['b'] + frame['d']
As an example of this data, 'frame' currently shows:
ID(n) a b c d i x.name x.type a+b c+d a+c b+d
0 1258065 5 28 31 1690 1754 Albumin low 33 1721 36 1718
1 1132105 4 19 32 1699 1754 Albumin low 23 1731 36 1718
2 898621 4 30 32 1688 1754 Albumin low 34 1720 36 1718
3 573158 4 30 32 1688 1754 Albumin low 34 1720 36 1718
4 572975 4 23 32 1695 1754 Albumin low 27 1727 36 1718
... ... ... ... ... ... ... ... ... ... ... ... ...
666646 12435 1 0 27 1726 1754 WHR low 1 1753 28 1726
666647 15119 1 0 27 1726 1754 WHR low 1 1753 28 1726
666648 17053 1 2 27 1724 1754 WHR low 3 1751 28 1726
666649 24765 1 3 27 1723 1754 WHR low 4 1750 28 1726
666650 8733 1 1 27 1725 1754 WHR low 2 1752 28 1726
Is the best way to convert these to a numpy array and process it through iteration, or keep it in pandas? I assume that I can't use math functions within a dataframe (I've tried math.comb(), which didn't work in a dataframe). I've also tried using pyranges for its fisher method but it seems it doesn't work with my environment (python 3.8).
Any help would be much appreciated!
Following the answer here which came from the author of pyranges (i think), let's say you data is something like:
import pandas as pd
import scipy.stats as stats
import numpy as np
np.random.seed(111)
df = pd.DataFrame(np.random.randint(1,100,(1000000,4)))
df.columns=['a','b','c','d']
df['ID'] = range(1000000)
df.head()
a b c d ID
0 85 85 85 87 0
1 20 42 67 83 1
2 41 72 58 8 2
3 13 11 66 89 3
4 29 15 35 22 4
You convert it into a numpy array and did it like in the post:
c = df[['a','b','c','d']].to_numpy(dtype='uint64')
from fisher import pvalue_npy
_, _, twosided = pvalue_npy(c[:, 0], c[:, 1], c[:, 2], c[:, 3])
df['odds'] = (c[:, 0] * c[:, 3]) / (c[:, 1] * c[:, 2])
df['pvalue'] = twosided
Or you can fit it directly:
_, _, twosided = pvalue_npy(df['a'].to_numpy(np.uint), df['b'].to_numpy(np.uint),
df['c'].to_numpy(np.uint), df['d'].to_numpy(np.uint))
df['odds'] = (df['a'] * df['d']) / (df['b'] * df['c'])
df['pvalue'] = twosided

Preserving NaN values when using groupby and lambda function on dataframe

Following on from this question, I have a dataset as such:
ChildID MotherID preDiabetes
0 20 455 No
1 20 455 Not documented
2 13 102 NaN
3 13 102 Yes
4 702 946 No
5 82 571 No
6 82 571 Yes
7 82 571 Not documented
8 60 530 NaN
Which I have transformed to the following such that each mother has a single value for preDiabetes:
ChildID MotherID preDiabetes
0 20 455 No
1 13 102 Yes
2 702 946 No
3 82 571 Yes
4 60 530 No
I did this by applying the following logic:
if preDiabetes=="Yes" for a particular MotherID, assign preDiabetes a value of "Yes" regardless of the remaining observations
else if preDiabetes != "Yes" for a particular MotherID, I will assign preDiabetes a value of "No"
However, after thinking about this again, I realised that I should preserve NaN values to impute them later on, rather than just assign them 'No".
So I should edit my logic to be:
if preDiabetes=="Yes" for a particular MotherID, assign preDiabetes a value of "Yes" regardless of the remaining observations
else if all values for preDiabetes==NaN for a particular MotherID, assign preDiabetes a single NaN value
else assign preDiabetes a value of "No"
So, in the above table MotherID=530 should have a value of NaN for preDiabetes like so:
ChildID MotherID preDiabetes
0 20 455 No
1 13 102 Yes
2 702 946 No
3 82 571 Yes
4 60 530 NaN
I tried doing this using the following line of code:
df=df.groupby(['MotherID', 'ChildID'])['preDiabetes'].apply(
lambda x: 'Yes' if 'Yes' in x.values else (np.NaN if np.NaN in x.values.all() else 'No'))
However, running this line of code is resulting in the following error:
TypeError: 'in ' requires string as left operand, not float
I'd appreciate if you guys can point out what it is I am doing wrong. Thank you.
You can try:
import pandas as pd
import numpy as np
import io
data_string = """ChildID,MotherID,preDiabetes
20,455,No
20,455,Not documented
13,102,NaN
13,102,Yes
702,946,No
82,571,No
82,571,Yes
82,571,Not documented
60,530,NaN
"""
data = io.StringIO(data_string)
df = pd.read_csv(data, sep=',', na_values=['NaN'])
df.fillna('no_value', inplace=True)
df = df.groupby(['MotherID', 'ChildID'])['preDiabetes'].apply(
lambda x: 'Yes' if 'Yes' in x.values else (np.NaN if 'no_value' in x.values.all() else 'No'))
df
Result:
MotherID ChildID
102 13 Yes
455 20 No
530 60 NaN
571 82 Yes
946 702 No
Name: preDiabetes, dtype: object
You can do using a custom function:
def func(s):
if s.eq('Yes').any():
return 'Yes'
elif s.isna().all():
return np.nan
else:
return 'No'
df = (df
.groupby(['ChildID', 'MotherID'])
.agg({'preDiabetes': func}))
print(df)
ChildID MotherID preDiabetes
0 13 102 Yes
1 20 455 No
2 60 530 NaN
3 82 571 Yes
4 702 946 No
Try:
df['preDiabetes']=df['preDiabetes'].map({'Yes': 1, 'No': 0}).fillna(-1)
df=df.groupby(['MotherID', 'ChildID'])['preDiabetes'].max().map({1: 'Yes', 0: 'No', -1: 'NaN'}).reset_index()
First line will format preDiabetes to numbers, assuming NaN to be everything other than Yes or No (denoted by -1).
Second line assuming at least one preDiabetes is Yes - we output Yes for the group. Assuming we have both No and NaN - we output No. Assuming all are NaN we output NaN.
Outputs:
>>> df
MotherID ChildID preDiabetes
0 102 13 Yes
1 455 20 No
2 530 60 NaN
3 571 82 Yes
4 946 702 No

Python/pandas: Find matching values from two dataframes and return third value

I have two different dataframes (df1, df2) with completely different shapes: df1: (64, 6); df2: (564, 9).
df1 contains a column (df1.objectdesc) which has values (strings) that can also be found in a column in df2 (df2.objdescription). As the two dataframes have different shapes I have to work with .isin() to get the matching values. I then would like to get a third value from a different column in df2 (df2.idname) from exactly those rows which match and add them to df1 - this is where I struggle.
example datasets:
df1
Content objectdesc TS_id
0 sdrgs 1_OG.Raum45 55
1 sdfg 2_OG.Raum23 34
2 psdfg GG.Raum12 78
3 sdfg 1_OG.Raum98 67
df2:
Numb_val object_count objdescription min idname
0 463 9876 1_OG_Raum76 1 wq19
1 251 8324 2_OG.Raum34 9 zt45
2 456 1257 1_OG.Raum45 4 bh34
3 356 1357 2_OG.Raum23 3 if32
4 246 3452 GG.Raum12 5 lu76
5 345 8553 1_OG.Raum98 8 pr61
expected output:
Content objectdesc TS_id idname
0 sdrgs 1_OG.Raum45 55 bh34
1 sdfg 2_OG.Raum23 34 if32
2 psdfg GG.Raum12 78 lu76
3 sdfg 1_OG.Raum98 67 pr61
This is my code so far:
def get_id(x, y):
for values in x,y:
if x['objectdesc'].isin(y['objdescription']).any() == True:
return y['idname']
df1['idname'] = get_id(df1, df2)
This unfortunately only provides the values of df2['idname'] starting from index 0, instead of really giving me the values from the rows which match.
Any help is appreciated. Thank you!
may be try this:
df1.merge(df2, left_on='objectdesc', right_on='objdescription')[['Content', 'objectdesc', 'TS_id', 'idname']]
reference:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge.html
You can merge the two.
from io import StringIO
import pandas as pd
df_1_str = \
'''
Content objectdesc TS_id
sdrgs 1_OG.Raum45 55
sdfg 2_OG.Raum23 34
psdfg GG.Raum12 78
sdfg 1_OG.Raum98 67
'''
df_2_str = \
'''
Numb_val object_count objdescription min idname
463 9876 1_OG_Raum76 1 wq19
251 8324 2_OG.Raum34 9 zt45
456 1257 1_OG.Raum45 4 bh34
356 1357 2_OG.Raum23 3 if32
246 3452 GG.Raum12 5 lu76
345 8553 1_OG.Raum98 8 pr61
'''
df_1 = pd.read_csv(StringIO(df_1_str), header=0, delim_whitespace=True)
df_2 = pd.read_csv(StringIO(df_2_str), header=0, delim_whitespace=True)
df_3 = df_1.merge(df_2[['objdescription', 'idname']], left_on='objectdesc',
right_on='objdescription').drop('objdescription', axis='columns')
Contents of df_3:
Content objectdesc TS_id idname
-- --------- ------------ ------- --------
0 sdrgs 1_OG.Raum45 55 bh34
1 sdfg 2_OG.Raum23 34 if32
2 psdfg GG.Raum12 78 lu76
3 sdfg 1_OG.Raum98 67 pr61

Iterating over pandas rows to get minimum

Here is my dataframe:
Date cell tumor_size(mm)
25/10/2015 113 51
22/10/2015 222 50
22/10/2015 883 45
20/10/2015 334 35
19/10/2015 564 47
19/10/2015 123 56
22/10/2014 345 36
13/12/2013 456 44
What I want to do is compare the size of the tumors detected on the different days. Let's consider the cell 222 as an example; I want to compare its size to different cells but detected on earlier days e.g. I will not compare its size with cell 883, because they were detected on the same day. Or I will not compare it with cell 113, because it was detected later on.
As my dataset is too large, I have iterate over the rows. If I explain it in a non-pythonic way:
for the cell 222:
get_size_distance(absolute value):
(50 - 35 = 15), (50 - 47 = 3), (50 - 56 = 6), (50 - 36 = 14), (44 - 36 = 8)
get_minumum = 3, I got this value when I compared it with 564, so I will name it as a pait for the cell 222
Then do it for the cell 883
The resulting output should look like this:
Date cell tumor_size(mm) pair size_difference
25/10/2015 113 51 222 1
22/10/2015 222 50 123 6
22/10/2015 883 45 456 1
20/10/2015 334 35 345 1
19/10/2015 564 47 456 3
19/10/2015 123 56 456 12
22/10/2014 345 36 456 8
13/12/2013 456 44 NaN NaN
I will really appreciate your help
It's not pretty, but I believe it does the trick
a = pd.read_clipboard()
# Cut off last row since it was a faulty date. You can skip this.
df = a.copy().iloc[:-1]
# Convert to dates and order just in case (not really needed I guess).
df['Date'] = df.Date.apply(lambda x: datetime.strptime(x, '%d/%m/%Y'))
df.sort_values('Date', ascending=False)
# Rename column
df = df.rename(columns={"tumor_size(mm)": 'tumor_size'})
# These will be our lists of pairs and size differences.
pairs = []
diffs = []
# Loop over all unique dates
for date in df.Date.unique():
# Only take dates earlier then current date.
compare_df = df.loc[df.Date < date].copy()
# Loop over each cell for this date and find the minimum
for row in df.loc[df.Date == date].itertuples():
# If no cells earlier are available use nans.
if compare_df.empty:
pairs.append(float('nan'))
diffs.append(float('nan'))
# Take lowest absolute value and fill in otherwise
else:
compare_df['size_diff'] = abs(compare_df.tumor_size - row.tumor_size)
row_of_interest = compare_df.loc[compare_df.size_diff == compare_df.size_diff.min()]
pairs.append(row_of_interest.cell.values[0])
diffs.append(row_of_interest.size_diff.values[0])
df['pair'] = pairs
df['size_difference'] = diffs
returns:
Date cell tumor_size pair size_difference
0 2015-10-25 113 51 222.0 1.0
1 2015-10-22 222 50 564.0 3.0
2 2015-10-22 883 45 564.0 2.0
3 2015-10-20 334 35 345.0 1.0
4 2015-10-19 564 47 345.0 11.0
5 2015-10-19 123 56 345.0 20.0
6 2014-10-22 345 36 NaN NaN

Sorting and arranging a list using pandas

I have an input file as shown below which needs to be arranged in such an order that the key values need to be in ascending order, while the keys which are not present need to be printed in the last.
I am getting the data arranged in the required format but the order is missing.
I have tried using sort() method but it shows "list has no attribute sort".
Please suggest solution and also suggest if any modifications required.
Input file:
3=1388|4=1388|5=IBM|8=157.75|9=88929|1021=1500|854=n|388=157.75|394=157.75|474=157.75|1584=88929|444=20160713|459=93000546718000|461=7|55=93000552181000|22=89020|400=157.75|361=0.73|981=0|16=1468416600.6006|18=1468416600.6006|362=0.46
3=1388|4=1388|5=IBM|8=157.73|9=100|1021=0|854=p|394=157.73|474=157.749977558|1584=89029|444=20160713|459=93001362639104|461=26142|55=93001362849000|22=89120|361=0.71|981=0|16=1468416601.372|18=1468416601.372|362=0.45
3=1388|4=1388|5=IBM|8=157.69|9=100|1021=600|854=p|394=157.69|474=157.749910415|1584=89129|444=20160713|459=93004178882560|461=27052|55=93004179085000|22=89328|361=0.67|981=1|16=1468416604.1916|18=1468416604.1916|362=0.43
Code i tried:
import pandas as pd
import numpy as np
df = pd.read_csv('inputfile', index_col=None, names=['text'])
s = df.text.str.split('|')
ds = [dict(w.split('=', 1) for w in x) for x in s]
p = pd.DataFrame.from_records(ds)
p1 = p.replace(np.nan,'n/a', regex=True)
st = p1.stack(level=0,dropna=False)
dfs = [g for i,g in st.groupby(level=0)]
#print st
i = 0
while i < len(dfs):
#index of each column
print ('\nindex[%d]'%i)
for (_,k),v in dfs[i].iteritems():
print k,'\t',v
i = i + 1
output getting:
index[0]
1021 1500
1584 88929
16 1468416600.6006
18 1468416600.6006
22 89020
3 1388
361 0.73
362 0.46
388 157.75
394 157.75
4 1388
400 157.75
444 20160713
459 93000546718000
461 7
474 157.75
5 IBM
55 93000552181000
8 157.75
854 n
9 88929
981 0
index[1]
1021 0
1584 89029
16 1468416601.372
18 1468416601.372
22 89120
3 1388
361 0.71
362 0.45
388 n/a
394 157.73
4 1388
400 n/a
444 20160713
459 93001362639104
461 26142
474 157.749977558
5 IBM
55 93001362849000
8 157.73
854 p
9 100
981 0
Expected output:
index[0]
3 1388
4 1388
5 IBM
8 157.75
9 88929
16 1468416600.6006
18 1468416600.6006
22 89020
55 93000552181000
361 0.73
362 0.46
388 157.75
394 157.75
400 157.75
444 20160713
459 93000546718000
461 7
474 157.75
854 n
981 0
1021 1500
1584 88929
index[1]
3 1388
4 1388
5 IBM
8 157.75
9 88929
16 1468416600.6006
18 1468416600.6006
22 89020
55 93000552181000
361 0.73
362 0.46
394 157.75
444 20160713
459 93000546718000
461 7
474 157.75
854 n
981 0
1021 1500
1584 88929
388 n/a
400 n/a
Replace your ds line with
ds = [{int(pair[0]): pair[1] for pair in [w.split('=', 1) for w in x]} for x in s]
To convert the index to an integer so it will be sorted numerically
To output the n/a values at the end, you could use the pandas selection to output the nonnull values first, then the null values, e.g:
for (ix, series) in p.iterrows():
print('\nindex[%d]' % ix)
output_series(ix, series[pd.notnull])
output_series(ix, series[pd.isnull].fillna('n/a'))
btw, you can also simplify your stack, groupby, print to:
for (ix, series) in p1.iterrows():
print('\nindex[%d]' % ix)
for tag, value in series.iteritems():
print(tag, '\t', value)
So the whole script becomes:
def output_series(ix, series):
for tag, value in series.iteritems():
print(tag, '\t', value)
df = pd.read_csv('inputfile', index_col=None, names=['text'])
s = df.text.str.split('|')
ds = [{int(pair[0]): pair[1] for pair in [w.split('=', 1) for w in x]} for x in s]
p = pd.DataFrame.from_records(ds)
for (ix, series) in p.iterrows():
print('\nindex[%d]' % ix)
output_series(ix, series[pd.notnull])
output_series(ix, series[pd.isnull].fillna('n/a'))
Here:
import pandas as pd
import numpy as np
df = pd.read_csv('inputfile', index_col=None, names=['text'])
s = df.text.str.split('|')
ds = [dict(w.split('=', 1) for w in x) for x in s]
p1 = pd.DataFrame.from_records(ds).fillna('n/a')
st = p1.stack(level=0,dropna=False)
for k, v in st.groupby(level=0):
print(k, v.sort_index())

Categories