Splitting DataFrame in 2 based on a columns value - python

I have a dataset that I am trying to split into 2 smaller dataframes called test and train. The original dataset has two columns "patient_nbr" and "encounter_id". These columns all have 6 digit values.
How can I go through this dataframe, and add up all the digits in those two columns. For example, if in the first row of the dataframe the values are 123456 and 123456, I need to add 1+2+3+4+5+6+1+2+3+4+5+6. The sum is used to determine if that row goes into test or train. If it is even, test. If it is odd, train.
Below is what I tried. But it is so slow. I turned the two columns I need into two numpy arrays in order to break down and add up the digits. I added those numpy arrays to get one, and looped through that to get determine which dataframe it should go in.
with ZipFile('dataset_diabetes.zip') as zf:
with zf.open('dataset_diabetes/diabetic_data.csv','r') as f:
df = pd.read_csv(f)
nums1 = []
nums2 = []
encounters = df["encounter_id"].values
for i in range(len(encounters)):
result = 0
while encounters[i] > 0:
rem = encounters[i] % 10
result = result + rem
encounters[i] = int(encounters[i]/10)
nums1.append(result)
patients = df["patient_nbr"].values
for i in range(len(patients)):
result = 0
while patients[i] > 0:
rem = patients[i] % 10
result = result + rem
patients[i] = int(patients[i]/10)
nums2.append(result)
nums = np.asarray(nums1) + np.asarray(nums2)
df["num"] = nums
# nums = df["num"].values
train = pd.DataFrame()
test = pd.DataFrame()
for i in range(len(nums)):
if int(nums[i] % 2) == 0:
# goes to train
train.append(df.iloc[i])
else:
# goes to test
test.append(df.iloc[i])

you can do it by playing with astype to go from int to str to float, sum both columns over the row once string (like concatenate both strings), then str.split and expand the string, and sum again per row after selecting the right columns and change the type of each digit as float.
#dummy example
df = pd.DataFrame({'patient_nbr':[123456, 123457, 123458],
'encounter_id':[123456, 123456, 123457]})
#create num
df['num'] = df[['patient_nbr', 'encounter_id']].astype(str).sum(axis=1)\
.astype(str).str.split('', expand=True)\
.loc[:,1:12].astype(float).sum(axis=1)
print (df)
patient_nbr encounter_id num
0 123456 123456 42.0
1 123457 123456 43.0
2 123458 123457 45.0
then use this column to create a mask with even as False and odd as True
mask = (df['num']%2).astype(bool)
train = df.loc[~mask, :] #train is the even
test = df.loc[mask, :] #test is the odd
print (test)
patient_nbr encounter_id num
1 123457 123456 43.0
2 123458 123457 45.0

Related

Pandas Convert object values used in Passive Components to float

I have a dataframe of part numbers stored as object with a string containing 3 digits of values of following format:
Either 1R2, where the R is the decimal separator
Or only numbers where the first 2 are significant and the 3rd is the number of 0 following:
101 = 100
010 = 1
223 = 22000
476 = 47000000
My dataframe (important are positions 5~7):
MATNR
0 xx01B101KO3XYZC
1 xx03C010CA3GN5T
2 xx02L1R2CA3ANNR
Below code works fine for the 1R2 case and converts object to float64.
But I am stuck with getting the 2 significant numbers together with the number of 0s.
value_pos1 = 5
value_pos2 = 6
value_pos3 = 7
df['Value'] = pd.to_numeric(np.where(df['MATNR'].str.get(value_pos2)=='R',
df['MATNR'].str.get(value_pos1) + '.' + df['MATNR'].str.get(value_pos3),
df['MATNR'].str.slice(start=value_pos1, stop=value_pos3) + df['MATNR'].str.get(value_pos3)))
Result
MATNR object
Cap pF float64
dtype: object
Index(['MATNR', 'Value'], dtype='object')
MATNR Value
0 xx01B101KO3XYZC 101.0
1 xx03C010CA3GN5T 10.0
2 xx02L1R2CA3ANNR 1.2
It should be
MATNR Value
0 xx01B101KO3XYZC 100.0
1 xx03C010CA3GN5T 1.0
2 xx02L1R2CA3ANNR 1.2
Following I tried with errors and on top there is a wrong value for 0 # pos3 being 1 instead 0.
df['Value'] = pd.to_numeric(np.where(df['MATNR'].str.get(value_pos2)=='R',
df['MATNR'].str.get(Value_pos1) + '.' + df['MATNR'].str.get(value_pos3),
df['MATNR'].str.slice(start=value_pos1, stop=value_pos3) + str(pow(10, pd.to_numeric(df['MATNR'].str.get(value_pos3))))))
Do you have an idea?
If I have understood your problem correctly, defining a method and applying it to all the values of the column seems most intuitive. The method takes a str input and returns a float number.
Here is a snippet of what the simple method will entaik.
def get_number(strrep):
if not strrep or len(strrep) < 8:
return 0.0
useful_str = strrep[5:8]
if useful_str[1] == 'R':
return float(useful_str[0] + '.' + useful_str[2])
else:
zeros = '0' * int(useful_str[2])
return float(useful_str[0:2] + zeros)
Then you could simply create a new column with the numeric conversion of the strings. The easiest way possible is using list comprehension:
df['Value'] = [get_number(x) for x in df['MATNR']]
Not sure where the bug in your code is, but another option that I tend to use when creating a new column based on other columns is pandas' apply function:
def create_value_col(row):
if row['MATNR'][value_pos2] == 'R':
val = row['MATNR'][value_pos1] + '.' + row['MATNR'][value_pos3]
else:
val = (int(row['MATNR'][value_pos1]) * 10 +
int(row['MATNR'][value_pos2])) * 10 ** int(row['MATNR'][value_pos3])
return val
df['Value'] = df.apply(lambda row: create_value_col(row), axis='columns')
This way, you can create a function that processes the data however you want and then apply it to every row and add the resulting series to your dataframe.

Filter by the number of digits pandas

I have a Dataframe that has only one column with numbers ranging from 1 to 10000000000.
df1 =
165437890
2321434256
324334567
4326457
243567869
234567843
......
7654356785432
7654324567543
I want to have a resulting Dataframe that only has numbers with 9 digits, and that those digits are different from each other, is this possible ? I don't have a clue on how to start
OBS:
1) I need to filter out the number that has repeated digits
for example :
122234543 would go out of my DataFrame since it has the number 2 repeated 3 times and the numbers 4 and 3 repeated 2 times
def is_good(num):
numstr = list(str(num))
if len(numstr) == 9 and len(set(numstr))==9:
return True
return False
df1[df.apply(is_good)]
flt = (df.Numbers >= 100000000) & (df.Numbers < 1000000000)
df = pd.DataFrame(df[flt]['Numbers'].unique())
Where Numbers is the column name with your numbers.
Solution for digits that are different from each other in the number itself:
df.Numbers = df.Numbers.astype('str')
df = df[df.Numbers.str.match(r'^(?!.*(.).*\1)[0-9]{9}$')]
Or another solution based on the Igor's answer:
def has_unique_9digits(n):
s = str(n)
return len(s) == len(set(s)) == 9
df = df[df.Numbers.apply(has_unique_9digits)]

How to compare rows of two different dataframes

I have 2 dataframes(df and df_flagMax) that are not the same in size. I need help on the structure of comparing two different databases that are not the same in size. I want to compare the rows of both dataframes.
df = pd.read_excel('df.xlsx')
df_flagMax = df.groupby(['Name'], as_index=False)['Max'].max()
df['flagMax'] = 0
num = len(df)
for i in range(num):
colMax = df.at[i, 'Name']
df['flagMax'][(df['Max'] == colMax)] = 1
print(df)
df_flagMax data:
Name Max
0 Sf 39.91
1 Th -25.74
df data:
For example: I want to compare 'Sf' from both df and df_flagMax and then perform this line:
df['flag'][(df['Max'] == colMax)] = 1
if and only if the 'Sf' is in both dataframes on the same row index. The same goes for the next Name value ... 'Th'

what can I do to make long to wide format in python

I have this long data. I like to sort this by 30 each and save separately.
Data print like this,
A292340
A291630
A278240
A267770
A267490
A261250
A261110
A253150
A252400
A253250
A243890
A243880
A236350
A233740
A233160
A225800
A225060
A225050
A225040
A225130
A219900
A204450
A204480
A204420
A196030
A196220
A167860
A152500
A123320
A122630
.
This is fairly simple question, but I need your help..
Thank you.
(And how can I make a list out of one results printed? list addtion?
I believe need create MultiIndex by modulo and floor divide np.arange by length of DataFrame and then unstack:
But if length modulo is not equal 0 (e.g. (30 % 12)), last values are not matched to last column and Nones are added:
N = 12
r = np.arange(len(df))
df.index = [r % N, r // N]
df = df['col'].unstack()
print (df)
0 1 2
0 A292340 A236350 A196030
1 A291630 A233740 A196220
2 A278240 A233160 A167860
3 A267770 A225800 A152500
4 A267490 A225060 A123320
5 A261250 A225050 A122630
6 A261110 A225040 None
7 A253150 A225130 None
8 A252400 A219900 None
9 A253250 A204450 None
10 A243890 A204480 None
11 A243880 A204420 None
Setup:
d = {'col': ['A292340', 'A291630', 'A278240', 'A267770', 'A267490', 'A261250', 'A261110', 'A253150', 'A252400', 'A253250', 'A243890', 'A243880', 'A236350', 'A233740', 'A233160', 'A225800', 'A225060', 'A225050', 'A225040', 'A225130', 'A219900', 'A204450', 'A204480', 'A204420', 'A196030', 'A196220', 'A167860', 'A152500', 'A123320', 'A122630']}
df = pd.DataFrame(d)
print (df.head())
col
0 A292340
1 A291630
2 A278240
3 A267770
4 A267490
If you don't have Pandas and Numpy modules you can use this:
Setup:
long_list = ['A292340', 'A291630', 'A278240', 'A267770', 'A267490', 'A261250', 'A261110', 'A253150', 'A252400',
'A253250', 'A243890', 'A243880', 'A236350', 'A233740', 'A233160', 'A225800', 'A225060', 'A225050',
'A225040', 'A225130', 'A219900', 'A204450', 'A204480', 'A204420', 'A196030', 'A196220', 'A167860',
'A152500', 'A123320', 'A122630', 'A292340', 'A291630', 'A278240', 'A267770', 'A267490', 'A261250',
'A261110', 'A253150', 'A252400', 'A253250', 'A243890', 'A243880', 'A236350', 'A233740', 'A233160',
'A225800', 'A225060', 'A225050', 'A225040', 'A225130', 'A219900', 'A204450', 'A204480', 'A204420',
'A196030', 'A196220', 'A167860', 'A152500', 'A123320', 'A122630']
Code:
number_elements_in_sublist = 30
sublists = []
sublists.append([])
sublist_index = 0
for index, element in enumerate(long_list):
sublists[sublist_index].append(element)
if index > 0:
if (index+1) % number_elements_in_sublist == 0:
if index == len(long_list)-1:
break
sublists.append([])
sublist_index += 1
for index, sublist in enumerate(sublists):
print("Sublist Nr." + str(index+1))
for element in sublist:
print(element)

Delimiting contiguous regions with values above a certain threshold in Pandas DataFrame

I have a Pandas Dataframe of indices and values between 0 and 1, something like this:
6 0.047033
7 0.047650
8 0.054067
9 0.064767
10 0.073183
11 0.077950
I would like to retrieve tuples of the start and end points of regions of more than 5 consecutive values that are all over a certain threshold (e.g. 0.5). So that I would have something like this:
[(150, 185), (632, 680), (1500,1870)]
Where the first tuple is of a region that starts at index 150, has 35 values that are all above 0.5 in row, and ends on index 185 non-inclusive.
I started by filtering for only values above 0.5 like so
df = df[df['values'] >= 0.5]
And now I have values like this:
632 0.545700
633 0.574983
634 0.572083
635 0.595500
636 0.632033
637 0.657617
638 0.643300
639 0.646283
I can't show my actual dataset, but the following one should be a good representation
import numpy as np
from pandas import *
np.random.seed(seed=901212)
df = DataFrame(range(1,501), columns=['indices'])
df['values'] = np.random.rand(500)*.5 + .35
yielding:
1 0.491233
2 0.538596
3 0.516740
4 0.381134
5 0.670157
6 0.846366
7 0.495554
8 0.436044
9 0.695597
10 0.826591
...
Where the region (2,4) has two values above 0.5. However this would be too short. On the other hand, the region (25,44) with 19 values above 0.5 in a row would be added to list.
You can find the first and last element of each consecutive region by looking at the series and 1-row shifted values, and then filter the pairs which are adequately apart from each other:
# tag rows based on the threshold
df['tag'] = df['values'] > .5
# first row is a True preceded by a False
fst = df.index[df['tag'] & ~ df['tag'].shift(1).fillna(False)]
# last row is a True followed by a False
lst = df.index[df['tag'] & ~ df['tag'].shift(-1).fillna(False)]
# filter those which are adequately apart
pr = [(i, j) for i, j in zip(fst, lst) if j > i + 4]
so for example the first region would be:
>>> i, j = pr[0]
>>> df.loc[i:j]
indices values tag
15 16 0.639992 True
16 17 0.593427 True
17 18 0.810888 True
18 19 0.596243 True
19 20 0.812684 True
20 21 0.617945 True
I think this prints what you want. It is based heavily on Joe Kington's answer here I guess it is appropriate to up-vote that.
import numpy as np
# from Joe Kington's answer here https://stackoverflow.com/a/4495197/3751373
# with minor edits
def contiguous_regions(condition):
"""Finds contiguous True regions of the boolean array "condition". Returns
a 2D array where the first column is the start index of the region and the
second column is the end index."""
# Find the indicies of changes in "condition"
d = np.diff(condition,n=1, axis=0)
idx, _ = d.nonzero()
# We need to start things after the change in "condition". Therefore,
# we'll shift the index by 1 to the right. -JK
# LB this copy to increment is horrible but I get
# ValueError: output array is read-only without it
mutable_idx = np.array(idx)
mutable_idx += 1
idx = mutable_idx
if condition[0]:
# If the start of condition is True prepend a 0
idx = np.r_[0, idx]
if condition[-1]:
# If the end of condition is True, append the length of the array
idx = np.r_[idx, condition.size] # Edit
# Reshape the result into two columns
idx.shape = (-1,2)
return idx
def main():
import pandas as pd
RUN_LENGTH_THRESHOLD = 5
VALUE_THRESHOLD = 0.5
np.random.seed(seed=901212)
data = np.random.rand(500)*.5 + .35
df = pd.DataFrame(data=data,columns=['values'])
match_bools = df.values > VALUE_THRESHOLD
print('with boolian array')
for start, stop in contiguous_regions(match_bools):
if (stop - start > RUN_LENGTH_THRESHOLD):
print (start, stop)
if __name__ == '__main__':
main()
I would be surprised if there were not more elegant ways

Categories