I have been trying to update the Column values in data frame based on another Column values which contains strings.
import pandas as pd
import numpy as np
1. df=pd.read_excel('C:\\Users\\bahlrajesh23\\datascience\\Invoice.xlsx')
2. df1 =( df[df['Vendor'].str.contains('holding')] )
3. df['cat'] = pd.np.where(df['Vendor'].str.contains('holding'),"Yes",'' )
4. print(df[0:5])
The code up to line 4 above is working well but now I want to add more conditions in line 3 and I amended the line 3 above like this.
df['cat'] = pd.np.where((df['Vendor'].str.contains('holding'),"Yes",''),
(df['Vendor'].str.contains('tech'),"tech",''))
I am getting following error
ValueError: either both or neither of x and y should be given
How can I achieve this?
Because you're wanting to return different answers for each condition, using np.where() won't work. map() would also be difficult.
You can use apply() and make the function as complex as you need.
df = pd.DataFrame({'Vendor':['techi', 'tech', 'a', 'hold', 'holding', 'holdingon', 'techno', 'b']})
df
def add_cat(x):
if 'tech' in x:
return 'tech'
if'holding' in x:
return 'Yes'
else:
return ''
df['cat'] = df['Vendor'].apply(add_cat)
Vendor cat
0 techi tech
1 tech tech
2 a
3 hold
4 holding Yes
5 holdingon Yes
6 techno tech
7 b
Related
I have many different tables that all have different column names and each refer to an outcome, like glucose, insulin, leptin etc (except keep in mind that the tables are all gigantic and messy with tons of other columns in them as well).
I am trying to generate a report that starts empty but then adds columns based on functions applied to each of the glucose, insulin, and leptin tables.
I have included a very simple example - ignore that the function makes little sense. The below code works, but I would like to, instead of copy + pasting final_report["outcome"] = over and over again, just run the find_result function over each of glucose, insulin, and leptin and add the "glucose_result", "insulin_result" and "leptin_result" to the final_report in one or a few lines.
Thanks in advance.
import pandas as pd
ids = [1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,4,4,4,4,4,4]
timepoint = [1,2,3,4,5,6,1,2,3,4,5,6,1,2,4,1,2,3,4,5,6]
outcome = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3]
glucose = pd.DataFrame({'id':ids,
'timepoint':timepoint,
'outcome':outcome})
insulin = pd.DataFrame({'id':ids,
'timepoint':timepoint,
'outcome':outcome})
leptin = pd.DataFrame({'id':ids,
'timepoint':timepoint,
'outcome':outcome})
ids = [1,2,3,4]
start = [1,1,1,1]
end = [6,6,6,6]
final_report = pd.DataFrame({'id':ids,
'start':start,
'end':end})
def find_result(subject, start, end, df):
df = df.loc[(df["id"] == subject) & (df["timepoint"] >= start) & (df["timepoint"] <= end)].sort_values(by = "timepoint")
return df["timepoint"].nunique()
final_report['glucose_result'] = final_report.apply(lambda x: find_result(x['id'], x['start'], x['end'], glucose), axis=1)
final_report['insulin_result'] = final_report.apply(lambda x: find_result(x['id'], x['start'], x['end'], insulin), axis=1)
final_report['leptin_result'] = final_report.apply(lambda x: find_result(x['id'], x['start'], x['end'], leptin), axis=1)
If you have to use this code structure, you can create a simple dictionary with your dataframes and their names and loop through them, creating new columns with programmatically assigned names:
input_dfs = {"glucose": glucose, "insulin": insulin, "leptin": leptin}
for name, df in input_dfs.items():
final_report[f"{name}_result"] = final_report.apply(
lambda x: find_result(x['id'], x['start'], x['end'], df),
axis=1
)
Output:
id start end glucose_result insulin_result leptin_result
0 1 1 6 6 6 6
1 2 1 6 6 6 6
2 3 1 6 3 3 3
3 4 1 6 6 6 6
I am new to datascience your help is appreciated. my question is regarding grouping dataframe based on columns so that bar chart will be plotted based on each subject status
my csv file is something like this
Name,Maths,Science,English,sports
S1,Pass,Fail,Pass,Pass
S2,Pass,Pass,NA,Pass
S3,Pass,Fail,Pass,Pass
S4,Pass,Pass,Pass,NA
S5,Pass,Fail,Pass,NA
expected o/p:
Subject,Status,Count
Maths,Pass,5
Science,Pass,2
Science,Fail,3
English,Pass,4
English,NA,1
Sports,Pass,3
Sports,NA,2
You can do this with pandas, not exactly in the same output format in the question, but definitely having the same information:
import pandas as pd
# reading csv
df = pd.read_csv("input.csv")
# turning columns into rows
melt_df = pd.melt(df, id_vars=['Name'], value_vars=['Maths', 'Science', "English", "sports"], var_name="Subject", value_name="Status")
# filling NaN values, otherwise the below groupby will ignore them.
melt_df = melt_df.fillna("Unknown")
# counting per group of subject and status.
result_df = melt_df.groupby(["Subject", "Status"]).size().reset_index(name="Count")
Then you get the following result:
Subject Status Count
0 English Pass 4
1 English Unknown 1
2 Maths Pass 5
3 Science Fail 3
4 Science Pass 2
5 sports Pass 3
6 sports Unknown 2
PS: Going forward, always paste code on what you've tried so far
To match exactly your output, this is what you could do:
import pandas as pd
df = pd.read_csv('c:/temp/data.csv') # Or where ever your csv file is
subjects = ['Maths', 'Science' , 'English' , 'sports'] # Or you could get that as df.columns and drop 'Name'
grouped_rows = []
for eachsub in subjects:
rows = df.groupby(eachsub)['Name'].count()
idx = list(rows.index)
if 'Pass' in idx:
grouped_rows.append([eachsub, 'Pass', rows['Pass']])
if 'Fail' in idx:
grouped_rows.append([eachsub, 'Fail', rows['Fail']])
new_df = pd.DataFrame(grouped_rows, columns=['Subject', 'Grade', 'Count'])
print(new_df)
I must suggest though that I would avoid getting into the for loop. My approach would be just these two lines:
subjects = ['Maths', 'Science' , 'English' , 'sports']
grouped_rows = df.groupby(eachsub)['Name'].count()
Depending on your application, you already have the data available in grouped_rows
I have a CSV file that looks like below, this is same like my last question but this is by using Pandas.
Group Sam Dan Bori Son John Mave
A 0.00258844 0.983322 1.61479 1.2785 1.96963 10.6945
B 0.0026034 0.983305 1.61198 1.26239 1.9742 10.6838
C 0.0026174 0.983294 1.60913 1.24543 1.97877 10.6729
D 0.00263062 0.983289 1.60624 1.22758 1.98334 10.6618
E 0.00264304 0.98329 1.60332 1.20885 1.98791 10.6505
I have a function like below
def getnewno(value):
value = value + 30
if value > 40 :
value = value - 20
else:
value = value
return value
I want to send all these values to the getnewno function and get a newvalue and update the CSV file. How can this be accomplished in Pandas.
Expected output:
Group Sam Dan Bori Son John Mave
A 30.00258844 30.983322 31.61479 31.2785 31.96963 20.6945
B 30.0026034 30.983305 31.61198 31.26239 31.9742 20.6838
C 30.0026174 30.983294 31.60913 31.24543 31.97877 20.6729
D 30.00263062 30.983289 31.60624 31.22758 31.98334 20.6618
E 30.00264304 30.98329 31.60332 31.20885 31.98791 20.6505
The following should give you what you desire.
Applying a function
Your function can be simplified and here expressed as a lambda function.
It's then a matter of applying your function to all of the columns. There are a number of ways to do so. The first idea that comes to mind is to loop over df.columns. However, we can do better than this by using the applymap or transform methods:
import pandas as pd
# Read in the data from file
df = pd.read_csv('data.csv',
sep='\s+',
index_col=0)
# Simplified function with which to transform data
getnewno = lambda value: value + 10 if value > 10 else value + 30
# Looping over columns
#for col in df.columns:
# df[col] = df[col].apply(getnewno)
# Apply to all columns without loop
df = df.applymap(getnewno)
# Write out updated data
df.to_csv('data_updated.csv')
Using broadcasting
You can achieve your result using broadcasting and a little boolean logic. This avoids looping over any columns, and should ultimately prove faster and less memory intensive (although if your dataset is small any speed-up would be negligible):
import pandas as pd
df = pd.read_csv('data.csv',
sep='\s+',
index_col=0)
df += 30
make_smaller = df > 40
df[make_smaller] -= 20
First of all, your getnewno function looks too complicated... it can be simplified to e.g.:
def getnewno(value):
if value + 30 > 40:
return value - 20
else:
return value
you can even change value + 30 > 40 to value > 10.
Or even a oneliner if you want:
getnewno = lambda value: value-20 if value > 10 else value
Having the function you can apply it to specific values/columns. For example, if want you to create a column Mark_updated basing on Mark column, it should look like this (I assume your pandas DataFrame is called df):
df['Mark_updated'] = df['Mark'].apply(getnewno)
Use the mask function to do an if-else solution, before writing the data to csv
res = (df
.select_dtypes('number')
.add(30)
#the if-else comes in here
#if any entry in the dataframe is greater than 40, subtract 20 from it
#else leave as is
.mask(lambda x: x>40, lambda x: x.sub(20))
)
#insert the group column back
res.insert(0,'Group',df.Group.array)
write to csv
res.to_csv(filename)
Group Sam Dan Bori Son John Mave
0 A 30.002588 30.983322 31.61479 31.27850 31.96963 20.6945
1 B 30.002603 30.983305 31.61198 31.26239 31.97420 20.6838
2 C 30.002617 30.983294 31.60913 31.24543 31.97877 20.6729
3 D 30.002631 30.983289 31.60624 31.22758 31.98334 20.6618
4 E 30.002643 30.983290 31.60332 31.20885 31.98791 20.6505
I already asked a similar question but was able to piece some more of it together but need more help. Determining how one date/time range overlaps with the second date/time range?
I want to be able to check when two date range with start date/time and end date/time overlap. My type2 has about 50 rows while type 1 has over 500. I want to be able to take the start and end of type2 and see if it falls within type1 range. Here is a snip of the data, however the dates do change down the list from 2019-04-01 the the following days.
type1 type1_start type1_end
a 2019-04-01T00:43:18.046Z 2019-04-01T00:51:35.013Z
b 2019-04-01T02:16:46.490Z 2019-04-01T02:23:23.887Z
c 2019-04-01T03:49:31.981Z 2019-04-01T03:55:16.153Z
d 2019-04-01T05:21:22.131Z 2019-04-01T05:28:05.469Z
type2 type2_start type2_end
1 2019-04-01T00:35:12.061Z 2019-04-01T00:37:00.783Z
2 2019-04-02T00:37:15.077Z 2019-04-02T00:39:01.393Z
3 2019-04-03T00:39:18.268Z 2019-04-03T00:41:01.844Z
4 2019-04-04T00:41:21.576Z 2019-04-04T00:43:02.071Z`
I have been googling the best way to this and have read through Determine Whether Two Date Ranges Overlap and understand how it should be done, but I don't know enough about how to call for the variables and make them work.
#Here is what I have, but I am stuck and have no clue where to go form here:
import pandas as pd
from pandas import Timestamp
import numpy as np
from collections import namedtuple
colnames = ['type1', 'type1_start', 'type1_end', 'type2', 'type2_start', 'type2_end']
data = pd.read_csv('test.csv', names=colnames, parse_dates=['type1_start', 'type1_end','type2_start', 'type2_end'])
A_start = data['type1_start']
A_end = data['type1_end']
B_start= data['typer2_start']
B_end = data['type2_end']
t1 = data['type1']
t2 = data['type2']
r1 = (B_start, B_end)
r2 = (A_start, A_end)
def doesOverlap(r1, r2):
if B_start > A_start:
swap(r1, r2)
if A_start > B_end:
return false
return true
It would be nice to have a csv with a result of true or false overlap. I was able to make my data run using this also Efficiently find overlap of date-time ranges from 2 dataframes but it isn't correct in the results. I added couple of rows that I know should overlap to the data, and it didn't work. I'd need for each type2 start/end to go through each type1.
Any help would be greatly appreciated.
Here is one way to do it:
import pandas as pd
def overlaps(row):
if ((row['type1_start'] < row['type2_start'] and row['type2_start'] < row['type1_end'])
or (row['type1_start'] < row['type2_end'] and row['type2_end'] < row['type1_end'])):
return True
else:
return False
colnames = ['type1', 'type1_start', 'type1_end', 'type2', 'type2_start', 'type2_end']
df = pd.read_csv('test.csv', names=colnames, parse_dates=[
'type1_start', 'type1_end', 'type2_start', 'type2_end'])
df['overlap'] = df.apply(overlaps, axis=1)
print('\n', df)
gives:
type1 type1_start type1_end type2 type2_start type2_end overlap
0 type1 type1_start type1_end type2 type2_start type2_end False
1 a 2019-03-01T00:43:18.046Z 2019-04-02T00:51:35.013Z 1 2019-04-01T00:35:12.061Z 2019-04-01T00:37:00.783Z True
2 b 2019-04-01T02:16:46.490Z 2019-04-01T02:23:23.887Z 2 2019-04-02T00:37:15.077Z 2019-04-02T00:39:01.393Z False
3 c 2019-04-01T03:49:31.981Z 2019-04-01T03:55:16.153Z 3 2019-04-03T00:39:18.268Z 2019-04-03T00:41:01.844Z False
4 d 2019-04-01T05:21:22.131Z 2019-04-01T05:28:05.469Z 4 2019-04-04T00:41:21.576Z 2019-04-04T00:43:02.071Z False
Below df1 contains type1 records and df2 contains type2 records:
df_new = df1.assign(key=1)\
.merge(df2.assign(key=1), on='key')\
.assign(has_overlap=lambda x: ~((x.type2_start > x.type1_end) | (x.type2_end < x.type1_start)))
REF: Performant cartesian product (CROSS JOIN) with pandas
Hello my dear coders,
I'm new to coding and I've stumbled upon a problem. I want to split a column of a csv file that I have imported via pandas in Python. The column name is CATEGORY and contains 1, 2 or 3 values such seperated by a comma (IE: 2343, 3432, 4959) Now I want to split these values into seperate columns named CATEGORY, SUBCATEGORY and SUBSUBCATEGORY.
I have tried this line of code:
products_combined[['CATEGORY','SUBCATEGORY', 'SUBSUBCATEGORY']] = products_combined.pop('CATEGORY').str.split(expand=True)
But I get this error: ValueError: Columns must be same length as key
Would love to hear your feedback <3
You need:
pd.DataFrame(df.CATEGORY.str.split(',').tolist(), columns=['CATEGORY','SUBCATEGORY', 'SUBSUBCATEGORY'])
Output:
CATEGORY SUBCATEGORY SUBSUBCATEGORY
0 2343 3432 4959
1 2343 3432 4959
I think this could be accomplished by creating three new columns and assigning each to a lambda function applied to the 'CATEGORY' column. Like so:
products_combined['SUBCATEGORY'] = products_combined['CATEGORY'].apply(lambda original: original[1] if len(original) > 1 else None)
products_combined['SUBSUBCATEGORY'] = products_combined['CATEGORY'].apply(lambda original: original[2] if len(original) > 2 else None)
products_combined['CATEGORY'] = products_combined['CATEGORY'].apply(lambda original: original[0])
The apply() method called on a series returns a new series that contains the result of running the passed function (in this case, the lambda function) on each row of the original series.
IIUC, use split and then Series:
(
df[0].apply(lambda x: pd.Series(x.split(",")))
.rename(columns={0:"CATEGORY", 1:"SUBCATEGORY", 2:"SUBSUBCATEGORY"})
)
CATEGORY SUBCATEGORY SUBSUBCATEGORY
0 2343 3432 4959
1 1 NaN NaN
2 44 55 NaN
Data:
d = [["2343,3432,4959"],["1"],["44,55"]]
df = pd.DataFrame(d)