Start index (under FIeld) from 1 with pandas DataFrame - python

I would like to start the index from 1 undes the "Field" column
df = pd.DataFrame(list(zip(total_points, passing_percentage)),
columns =['Pts Measured', '% pass'])
df = df.rename_axis('Field').reset_index()
df["Comments"] = ""
df
Output:
Field Pts Measured % pass Comments
0 0 92909 90.66
1 1 92830 91.85
2 2 130714 99.99

I found a similar question here: In Python pandas, start row index from 1 instead of zero without creating additional column
For your question, it would be as simple as adding the following line:
df["Field"] = np.arange(1, len(df) + 1)

Related

Run functions over many dataframes, add results to another dataframe, and dynamically name the resulting column with the name of the original df

I have many different tables that all have different column names and each refer to an outcome, like glucose, insulin, leptin etc (except keep in mind that the tables are all gigantic and messy with tons of other columns in them as well).
I am trying to generate a report that starts empty but then adds columns based on functions applied to each of the glucose, insulin, and leptin tables.
I have included a very simple example - ignore that the function makes little sense. The below code works, but I would like to, instead of copy + pasting final_report["outcome"] = over and over again, just run the find_result function over each of glucose, insulin, and leptin and add the "glucose_result", "insulin_result" and "leptin_result" to the final_report in one or a few lines.
Thanks in advance.
import pandas as pd
ids = [1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,4,4,4,4,4,4]
timepoint = [1,2,3,4,5,6,1,2,3,4,5,6,1,2,4,1,2,3,4,5,6]
outcome = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3]
glucose = pd.DataFrame({'id':ids,
'timepoint':timepoint,
'outcome':outcome})
insulin = pd.DataFrame({'id':ids,
'timepoint':timepoint,
'outcome':outcome})
leptin = pd.DataFrame({'id':ids,
'timepoint':timepoint,
'outcome':outcome})
ids = [1,2,3,4]
start = [1,1,1,1]
end = [6,6,6,6]
final_report = pd.DataFrame({'id':ids,
'start':start,
'end':end})
def find_result(subject, start, end, df):
df = df.loc[(df["id"] == subject) & (df["timepoint"] >= start) & (df["timepoint"] <= end)].sort_values(by = "timepoint")
return df["timepoint"].nunique()
final_report['glucose_result'] = final_report.apply(lambda x: find_result(x['id'], x['start'], x['end'], glucose), axis=1)
final_report['insulin_result'] = final_report.apply(lambda x: find_result(x['id'], x['start'], x['end'], insulin), axis=1)
final_report['leptin_result'] = final_report.apply(lambda x: find_result(x['id'], x['start'], x['end'], leptin), axis=1)
If you have to use this code structure, you can create a simple dictionary with your dataframes and their names and loop through them, creating new columns with programmatically assigned names:
input_dfs = {"glucose": glucose, "insulin": insulin, "leptin": leptin}
for name, df in input_dfs.items():
final_report[f"{name}_result"] = final_report.apply(
lambda x: find_result(x['id'], x['start'], x['end'], df),
axis=1
)
Output:
id start end glucose_result insulin_result leptin_result
0 1 1 6 6 6 6
1 2 1 6 6 6 6
2 3 1 6 3 3 3
3 4 1 6 6 6 6

carry out operation on pandas column using an IF statement

I have a pandas dataframe,
df = pd.DataFrame({"Id": [77000581079,77000458432,77000458433,77000458434,77000691973], "Code": ['FO07930', 'FO73597','FO03177','FO73596','FOZZZZZ']})
I want to check the value of each row in column Code to see if it matches str FOZZZZ
If the operation is False then I would like to concatenate Id value to Code value
So the expected output will be:
Id Code
0 77000581079 FO0793077000581079
1 77000458432 FO7359777000458432
2 77000458433 FO0317777000458433
3 77000458434 FO7359677000458434
4 77000691973 FOZZZZZ
Ive tried
df['Id'] = df['Id'].astype(str)
for x in df['Id']:
if x == 'FOZZZZ':
pass
else:
df['Id']+df['Code']
Which I thought would run over each row in Column Code to check if it is =
to 'FOZZZZ' if not then concatenate the columns but no joy..
df.loc[df['Code']!='FOZZZZZ', 'Code'] = df['Code'] + df['Id'].astype(str)
Use pandas.Series.where with eq:
s = df["Code"]
df["Code"] = s.where(s.eq("FOZZZZZ"), s + df["Id"].astype(str))
print(df)
Output:
Code Id
0 FO0793077000581079 77000581079
1 FO7359777000458432 77000458432
2 FO0317777000458433 77000458433
3 FO7359677000458434 77000458434
4 FOZZZZZ 77000691973
Try np.where(condition, solution if condition is true, solution if condition is false). Use .isin(to check) if FOZZZZZ exists and reverse using ~ to build a boolean query to be used as condition.
df['Code']=np.where(~df['Code'].isin(['FOZZZZZ']), df.Id.astype(str)+df.Code,df.Code)
Id Code
0 77000581079 77000581079FO07930
1 77000458432 77000458432FO73597
2 77000458433 77000458433FO03177
3 77000458434 77000458434FO73596
4 77000691973 FOZZZZZ
Or you could try using loc:
df['Code'] = df['Code'] + df['Id'].astype(str)
df.loc[df['Code'].str.contains('FOZZZZZ'), 'Code'] = 'FOZZZZZ'
print(df)
Output:
Code Id
0 FO0793077000581079 77000581079
1 FO7359777000458432 77000458432
2 FO0317777000458433 77000458433
3 FO7359677000458434 77000458434
4 FOZZZZZ 77000691973

Python dataframe, move record to another column if it contains specific value

I have the following data:
For example in row 2, I want to move all the "3:xxx" to column 3, and all the "4:xxx" to column 4. How can I do that?
Btw, I have tried this but it doest work:
df[3] = np.where((df[2].str.contains('3:')))
Dataset loading:
url = 'https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/iris.scale'
df = pd.read_csv(url,header=None,delim_whitespace=True)
I think the easiest thing to do would be to cleanse the data set before reading it into a dataframe. In looking at the data source, it looks like there are some rows with missing fields, IE:
# (Missing the 3's field)
'1 1:-0.611111 2:0.166667508 4:-0.916667'
So I would clean up the file before reading it. For this line, you could stick an extra space between 2:0.166667508 and 4:-0.916667 to denote a null 3rd column:
'1 1:-0.611111 2:0.166667508 4:-0.916667 '.split(' ')
# ['1', '1:-0.611111', '2:0.166667508', '4:-0.916667', '']
'1 1:-0.611111 2:0.166667508 4:-0.916667 '.split(' ')
# ['1', '1:-0.611111', '2:0.166667508', '', '4:-0.916667', '']
I agree with Greg's suggestion of cleansing the data set before reading it into a dataframe but still if you want to have shift on unmatched column with values then you may try this below one.
input.csv
1,1:-0.55,2:0.25,3:-0.86,4:-91
1,1:-0.57,2:0.26,3:-0.87,4:-0.92
1,1:-0.57,3:-0.89,4:-0.93,NaN
1,1:-0.58,2:0.25,3:-0.88,4:-0.99
Shift at particular index code
import pandas as pd
df = pd.read_csv('files/60009536-input.csv')
print(df)
for col_num in df.columns:
if col_num > '0': # Assuming there is no problem at index column 0
for row_val in df[col_num]:
if row_val != 'nan':
if col_num != row_val[:1]: # Comparing column number with sliced value
row = df[df[col_num] == row_val].index.values # on true get row index as we already know column #
print("Found at column {0} and row {1}".format(col_num, row))
r_value = df.loc[row, str(row_val[:1])].values # capturing value on target location
print("target location value", r_value)
# print("target location value", r_value[0][:1])
df.at[row, str(r_value[0][:1])] = r_value # shifting target location's value to its correct loc
df.at[row, str(row_val[:1])] = row_val # Shift to appropriate column
df.at[row, col_num] = 'NaN' # finally update that cell to NaN
print(df)
output:
0 1 2 3 4
0 1 1:-0.55 2:0.25 3:-0.86 4:-91
1 1 1:-0.57 2:0.26 3:-0.87 4:-0.92
2 1 1:-0.57 3:-0.89 4:-0.93 NaN
3 1 1:-0.58 2:0.25 3:-0.88 4:-0.99
Found at column 2 and row [2]
target location value ['4:-0.93']
0 1 2 3 4
0 1 1:-0.55 2:0.25 3:-0.86 4:-91
1 1 1:-0.57 2:0.26 3:-0.87 4:-0.92
2 1 1:-0.57 NaN 3:-0.89 4:-0.93
3 1 1:-0.58 2:0.25 3:-0.88 4:-0.99
Process finished with exit code 0

Create a dataframe To detail information of another dataframe

I have one dataframe with the value and number of payments and the start date. id like to create a new dataframe with the all the payments one row per month.
Can you guys give a tip about how to finish it?
# Import pandas library
import pandas as pd
# initialize list of lists
data = [[1,'2017-06-09',300,3]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['ID','DATE','VALUE','PAYMENTS'])
# print dataframe.
df
EXISTING DATAFRAME FIELDS:
DATAFRAME DESIRED, open the payments and update the date:
My first thought was to make a loop appending the payments. But if in this loop i already put the other fields and generate de new data frame, so the task would be done.
result = []
for value in df["PAYMENTS"]:
if value == 1:
result.append(1)
elif value ==3:
for x in range(1,4):
result.append(x)
else:
for x in range(1,7):
result.append(x)
Here's my try:
df.VALUE = df.VALUE / df.PAYMENTS
df = df.merge(df.ID.repeat(df.PAYMENTS), on='ID', how='outer')
df.PAYMENTS = df.groupby('ID').cumcount() + 1
Output:
ID DATE VALUE PAYMENTS
0 1 2017-06-09 100.0 1
1 1 2017-06-09 100.0 2
2 1 2017-06-09 100.0 3

Create a new column in a dataframe with increment number based on another column

Consider the below pandas DataFrame:
from pandas import Timestamp
df = pd.DataFrame({
'day': [Timestamp('2017-03-27'),
Timestamp('2017-03-27'),
Timestamp('2017-04-01'),
Timestamp('2017-04-03'),
Timestamp('2017-04-06'),
Timestamp('2017-04-07'),
Timestamp('2017-04-11'),
Timestamp('2017-05-01'),
Timestamp('2017-05-01')],
'act_id': ['916298883',
'916806776',
'923496071',
'926539428',
'930641527',
'931935227',
'937765185',
'966163233',
'966417205']
})
As you may see, there are 9 unique ids distributed in 7 days.
I am looking for a way to add two new columns.
The first column:
An increment number for each new day. For example 1 for '2017-03-27'(same number for same day), 2 for '2017-04-01', 3 for '2017-04-03', etc.
The second column:
An increment number for each new act_id per day. For example 1 for '916298883', 2 for '916806776' (which is linked to the same day '2017-03-27'), 1 for '923496071', 1 for '926539428', etc.
The final table should look like this
I have already tried to build the first column with apply and a function but it doesn't work as it should.
#Create helper function to give index number to a new column
counter = 1
def giveFlag(x):
global counter
index = counter
counter+=1
return index
And then:
# Create day flagger column
df_helper['day_no'] = df_helper['day'].apply(lambda x: giveFlag(x))
try this:
days = list(set(df['day']))
days.sort()
day_no = list()
iter_no = list()
for index,day in enumerate(days):
counter=1
for dfday in df['day']:
if dfday == day:
iter_no.append(counter)
day_no.append(index+1)
counter+=1
df['day_no'] = pd.Series(day_no).values
df['iter_no'] = pd.Series(iter_no).values

Categories