Pandas change column value if greater than len - python

I am trying to use an if statement to change values within a column if there length is greater than x
My csv data ..
ID Test_Case TC_NUM
14581,dialog_testcase_4000.0134_mvp_not_understood-inprogress.xml,4000.0134
14582,dialog_testcase_4000.0135_mvp_not_understood-inprogress.xml,4000.0135
14583,dialog_testcase_4000.0136_mvp_not_understood-inprogress.xml,4000.0136
14584,dialog_testcase_4000.0137_mvp_not_understood_6.2.0-inprogress.xml,4000.01376.2.0
14585,dialog_testcase_4000.0138_mvp_not_understood_6.2.0-inprogress.xml,4000.01386.2.0
What I want:
ID Test_Case TC_NUM
14581,dialog_testcase_4000.0134_mvp_not_understood-inprogress.xml,4000.0134
14582,dialog_testcase_4000.0135_mvp_not_understood-inprogress.xml,4000.0135
14583,dialog_testcase_4000.0136_mvp_not_understood-inprogress.xml,4000.0136
14584,dialog_testcase_4000.0137_mvp_not_understood_6.2.0-inprogress.xml,4000.0137
14585,dialog_testcase_4000.0138_mvp_not_understood_6.2.0-inprogress.xml,4000.0138
My current code is able to extract some of the right columns, but messes up if additional numbers are in there.
df1['TC_NUM'] = df1['TC_NUM'].str.replace(r'[^0-9.]+', '')
df1['TC_NUM'] = df1['TC_NUM'].str[:-1]
My thought/attempt as using an if statement to correct this.
if dfidtcnum(len['TC_NUM'] > 12):
print "True"

IIUC you can use mask:
print (df.TC_NUM.str.len() > 9)
0 False
1 False
2 False
3 True
4 True
Name: TC_NUM, dtype: bool
df['TC_NUM'] = df.TC_NUM.mask(df.TC_NUM.str.len() > 9, df.TC_NUM.str[:-5])
print (df)
ID Test_Case TC_NUM
0 14581 dialog_testcase_4000.0134_mvp_not_understood-i... 4000.0134
1 14582 dialog_testcase_4000.0135_mvp_not_understood-i... 4000.0135
2 14583 dialog_testcase_4000.0136_mvp_not_understood-i... 4000.0136
3 14584 dialog_testcase_4000.0137_mvp_not_understood_6... 4000.0137
4 14585 dialog_testcase_4000.0138_mvp_not_understood_6... 4000.0138

Related

How do I label data based on the values of the previous row?

I want to label the data "1" if the current value is higher than that of the previous row and "0" otherwise.
Lets say I have this DataFrame:
df = pd.DataFrame({'date': [1,2,3,4,5], 'price': [50.125, 45.25, 65.857, 100.956, 77.4152]})
and I want the output as if the DataFrame is constructed like this:
df = pd.DataFrame({'date': [1,2,3,4,5], 'price': [50.125, 45.25, 65.857, 100.956, 77.4152], 'label':[0, 0, 1, 1, 0]})
*I don't know how to post a DataFrame
These code is my attempt:
df['label'] = 0
i = 0
for price in df['price']:
i = i+1
if price in i > price: #---> right now I am stuck here. i=It says argument of type 'int' is not iterable
df.append['label', 1]
elif price in i <= price:
df.append['label', 0]
I think there are also other logical mistakes in my codes. What am I doing wrong?
Create boolean mask by Series.ge (>=) with Series.shift and convert to integers for True/False to 1/0 mapping by Series.view:
df['label'] = df['price'].ge(df['price'].shift()).view('i1')
Or by Series.astype:
df['label'] = df['price'].ge(df['price'].shift()).astype(int)
IIUC np.where with a boolean shift to see the change in the row price and if it's greater than the row above.
df['label'] = np.where(df['price'].ge(df['price'].shift()),1,0)
print(df)
date price label
0 1 50.1250 0
1 2 45.2500 0
2 3 65.8570 1
3 4 100.9560 1
4 5 77.4152 0
Explanation:
print(df['price'].ge(df['price'].shift()))
returns a boolean of True and False statements that we can use in our where clause
0 False
1 False
2 True
3 True
4 False
To explain what is happening in your code:
df['label'] should be initiated to an empty list, not "0". If you want to set the first value of the list to 0, you can do df['label'] = [0].
i is just the index value (0, 1, 2, 3...) and not the value of the price at a specific index (50.125, 45.25, 65.857...) , so it is not what you want to compare.
price in is used to check if the value of the price variable exists in a list that follows. The in statement isn't followed with a list, so you will get an error. You want to instead get the price value at a specific index and compare if it is greater or less than the value at the previous index.
The append method uses () and not [].
If you want to continue along your method of using a loop, the following can work:
df['label'] = []
for i in range(len(df['price'])):
if df['price'][i] > df['price'][i - 1]:
df['label'].append(1)
else:
df['label'].append(0)
What this does is loop through the range of the length of the price list. It then compares the values of the price at position i and position i - 1.
You can also further simplify the if/else statement using a ternary operator to:
df['label'].append(1 if df['price'][i] > df['price'][i - 1] else 0)
Working fiddle: https://repl.it/repls/HoarseImpressiveScientists

Change the consecutive value of selected rows in pandas

I need to update the value of the column matching the selection criteria and repeat it some consecutive times.
eg:
INPUT:
df = pd.DataFrame({"a" : [True,False,False,False,False,True,False,False]})
roll value to next 3 indexes
(input format)
OUTPUT:
output = pd.DataFrame({"a" : [True,True,True, False, False, True, True, True]}
(output format)
I looked at pandas.series.repeat but that adds new value. I need to make sure that the size remains the same.
Use .rolling(...) to get rolling window:
df.rolling(window=3, min_periods=1).agg(lambda x: any(x)).astype("bool")
Output:
a
0 True
1 True
2 True
3 False
4 False
5 True
6 True
7 True
You could use a while and for loop like so:
i = 0 // index
content = 'true' // set content to replace
while (i < 3): // run 3 times
for x in dataframe:
dataframe['column'] = dataframe['column'].replace(content, str(i)) // replace content at index i
i = i + 1 // add 1 to index
This way you replace values at a certain index while only looping through the first 3 indexes.

Python : switching values in panda dataframe

I just want to switch correct to false and false to correct in my panda data frame, doing what I have written below changes everything to correct. How do I fix this?
a.loc[(a["outcome"] == 'correct') 'outcome'] = 'false' and a.loc[(a["outcome"] == 'false'), 'outcome'] = 'correct'
Use map by dictionary and if some another values out of dict add fillna:
a = pd.DataFrame({'outcome':['correct','correct','false', 'val']})
print (a)
outcome
0 correct
1 correct
2 false
3 val
d = {'correct':'false', 'false':'correct'}
a['outcome'] = a['outcome'].map(d).fillna(a['outcome'])
print (a)
outcome
0 false
1 false
2 correct
3 val

what can I do to make long to wide format in python

I have this long data. I like to sort this by 30 each and save separately.
Data print like this,
A292340
A291630
A278240
A267770
A267490
A261250
A261110
A253150
A252400
A253250
A243890
A243880
A236350
A233740
A233160
A225800
A225060
A225050
A225040
A225130
A219900
A204450
A204480
A204420
A196030
A196220
A167860
A152500
A123320
A122630
.
This is fairly simple question, but I need your help..
Thank you.
(And how can I make a list out of one results printed? list addtion?
I believe need create MultiIndex by modulo and floor divide np.arange by length of DataFrame and then unstack:
But if length modulo is not equal 0 (e.g. (30 % 12)), last values are not matched to last column and Nones are added:
N = 12
r = np.arange(len(df))
df.index = [r % N, r // N]
df = df['col'].unstack()
print (df)
0 1 2
0 A292340 A236350 A196030
1 A291630 A233740 A196220
2 A278240 A233160 A167860
3 A267770 A225800 A152500
4 A267490 A225060 A123320
5 A261250 A225050 A122630
6 A261110 A225040 None
7 A253150 A225130 None
8 A252400 A219900 None
9 A253250 A204450 None
10 A243890 A204480 None
11 A243880 A204420 None
Setup:
d = {'col': ['A292340', 'A291630', 'A278240', 'A267770', 'A267490', 'A261250', 'A261110', 'A253150', 'A252400', 'A253250', 'A243890', 'A243880', 'A236350', 'A233740', 'A233160', 'A225800', 'A225060', 'A225050', 'A225040', 'A225130', 'A219900', 'A204450', 'A204480', 'A204420', 'A196030', 'A196220', 'A167860', 'A152500', 'A123320', 'A122630']}
df = pd.DataFrame(d)
print (df.head())
col
0 A292340
1 A291630
2 A278240
3 A267770
4 A267490
If you don't have Pandas and Numpy modules you can use this:
Setup:
long_list = ['A292340', 'A291630', 'A278240', 'A267770', 'A267490', 'A261250', 'A261110', 'A253150', 'A252400',
'A253250', 'A243890', 'A243880', 'A236350', 'A233740', 'A233160', 'A225800', 'A225060', 'A225050',
'A225040', 'A225130', 'A219900', 'A204450', 'A204480', 'A204420', 'A196030', 'A196220', 'A167860',
'A152500', 'A123320', 'A122630', 'A292340', 'A291630', 'A278240', 'A267770', 'A267490', 'A261250',
'A261110', 'A253150', 'A252400', 'A253250', 'A243890', 'A243880', 'A236350', 'A233740', 'A233160',
'A225800', 'A225060', 'A225050', 'A225040', 'A225130', 'A219900', 'A204450', 'A204480', 'A204420',
'A196030', 'A196220', 'A167860', 'A152500', 'A123320', 'A122630']
Code:
number_elements_in_sublist = 30
sublists = []
sublists.append([])
sublist_index = 0
for index, element in enumerate(long_list):
sublists[sublist_index].append(element)
if index > 0:
if (index+1) % number_elements_in_sublist == 0:
if index == len(long_list)-1:
break
sublists.append([])
sublist_index += 1
for index, sublist in enumerate(sublists):
print("Sublist Nr." + str(index+1))
for element in sublist:
print(element)

pandas filtering consecutive rows

I got a Dataframe with a Matrix colum like this
11034-A
11034-B
1120-A
1121-A
112570-A
113-A
113.558
113.787-A
113.787-B
114-A
11691-A
11691-B
117-A RRS
12 X R
12-476-AT-A
12-476-AT-B
I'd like to filter only matrix that ends with A or B only when they are consecutive, so in the example above 11034-A and 11034-B, 113.787-A and 113.787-B, 11691-A and 11691-B, 12-476-AT-A and 12-476-AT-B
I wrote the function that will compare those 2 strings and return True or False, the problem is I fail to see how to apply / applymap to the consecutive rows:
def isAB(stringA, stringB):
if stringA.endswith('A') and stringB.endswith('B') and stringA[:-1] == stringB[:-1]:
return True
else:
return False
I tried df['result'] = isAB(df['Matrix'].str, df['Matrix'].shift().str) to no-avail
I seem to lack something in the way I designed this
edit :
I think this works, looks like I overcomplicated at 1st :
df['t'] = (df['Matrix'].str.endswith('A') & df['Matrix'].shift(-1).str.endswith('B')) | (df['Matrix'].str.endswith('B') & df['Matrix'].shift(1).str.endswith('A'))
df['p'] = (df['Matrix'].str[:-1] == df['Matrix'].shift(-1).str[:-1]) | (df['Matrix'].str[:-1] == df['Matrix'].shift(1).str[:-1])
df['e'] = df['p'] & df['t']
final = df[df['e']]
Here is how I would do it.
df['ShiftUp'] = df['matrix'].shift(-1)
df['ShiftDown'] = df['matrix'].shift()
def check_matrix(x):
if pd.isnull(x.ShiftUp) == False and x.matrix[:-1] == x.ShiftUp[:-1]:
return True
elif pd.isnull(x.ShiftDown) == False and x.matrix[:-1] == x.ShiftDown[:-1]:
return True
else:
return False
df['new'] = df.apply(check_matrix, axis=1)
df = df.drop(['ShiftUp', 'ShiftDown'], axis=1)
print df
prints
matrix new
0 11034-A True
1 11034-B True
2 1120-A False
3 1121-A False
4 112570-A False
5 113-A False
6 113.558 False
7 113.787-A True
8 113.787-B True
9 114-A False
10 11691-A True
11 11691-B True
12 117-A RRS False
13 12 X R False
14 12-476-AT-A True
15 12-476-AT-B True
Here's my solution, it requires a bit of work.
The strategy is the following: obtain a new column that has the same values as the current column but shifted one position.
Then, it's just a matter to check whether one column is A or B and the other one B or A.
Say your matrix colum is called "column_name".
Then:
myl = ['11034-A',
'11034-B',
'1120-A',
'1121-A',
'112570-A',
'113-A',
'113.558',
'113.787-A',
'113.787-B',
'114-A',
'11691-A',
'11691-B',
'117-A RRS',
'12 X R',
'12-476-AT-A',
'12-476-AT-B']
#toy data frame
mydf = pd.DataFrame.from_dict({'column_name':myl})
#get a new series which is the same one as the original
#but the first entry contains "nothing"
new_series = pd.Series( ['nothing'] +
mydf['column_name'][:-1].values.tolist() )
#add it to the original dataframe
mydf['new_col'] = new_series
You then define a simple function:
def do_i_want_this_row(x,y):
left_char = x[-1]
right_char = y[-1]
return ((left_char == 'A') & (right_char == 'B')) or ((left_char == 'B') & (right_char=='A'))
and voila:
print mydf[mydf.apply(lambda x: do_i_want_this_row( x.column_name, x.new_col), axis=1)]
column_name new_col
1 11034-B 11034-A
2 1120-A 11034-B
8 113.787-B 113.787-A
9 114-A 113.787-B
11 11691-B 11691-A
15 12-476-AT-B 12-476-AT-A
There is still the question of the last element, but I'm sure you can think of what to do with it if you decide to follow this strategy ;)
You can delete rows from a DataFrame using DataFrame.drop(labels, axis). To get a list of labels to delete, I would first get a list of pairs that match your criterion. With the labels from above in a list labels and your isAB function,
pairs = zip(labels[:-1], labels[1:])
delete_pairs = filter(isAB, pairs)
delete_labels = []
for a,b in delete_pairs:
delete_labels.append(a)
delete_labels.append(b)
Examinedelete_labels to make sure you've put it together correctly,
print(delete_labels)
And finally, delete the rows. With the DataFrame in question as x,
x.drop(delete_labels) # or x.drop(delete_labels, axis) if appropriate

Categories