Python Dataframe replace multiple values that separated by comma single row - python

i have problem with this situation :
this is course table
I have this dataframe :
My expected outcome is :
I've tried using while, for loop and if else. but I think it is because incorrect position or understanding of the code :
Thanks for your help!

Based on your description, I have created sample data and tried to do what you are willing to.
df = pd.DataFrame()
df['academic_id'] = ['1','1,2,3','3,4','5,6']
# created a dictionary with the mappings we need
info = {'1':'course a','2':'course b', '3':'course c', '4':'course d','5':'course e', '6':'course f'}
def mapper(elem):
return ','.join([info[i] for i in elem.split(',')])
df['academic_id_text'] = df['academic_id'].apply(mapper)
print(df)
Output:
academic_id academic_id_text
0 1 course a
1 1,2,3 course a,course b,course c
2 3,4 course c,course d
3 5,6 course e,course f

Related

Auto Increment column value is larger than I expected

When I put data in DB with python,
I met a problem that auto increment column value is larger than I expected.
Assume that I use the following function multiple times to put data into the DB.
'db_engine' is a DB, which contains table 'tbl_student' and 'tbl_score'.
To manage the total number of student, table 'tbl_student' has Auto increment column named 'index'.
def save_in_db(db_engine, dataframe):
# tbl_student
student_dataframe = pd.DataFrame({
"ID":dataframe['ID'],
"NAME":dataframe['NAME'],
"GRADE":dataframe['GRADE'],
})
student_dataframe.to_sql(name="tbl_student",con=db_engine, if_exists='append', index=False)
# tbl_score
score_dataframe = pd.DataFrame({
"SCORE_MATH": dataframe['SCORE_MATH'],
"SCORE_SCIENCE":dataframe['SCORE_SCIENCE'],
"SCORE_HISTORY":dataframe['SCORE_HISTORY'],
})
score_dataframe.to_sql(name="tbl_score",con=db_engine, if_exists='append', index=False)
'tbl_student' after some inputs is as follows:
index
ID
NAME
GRADE
0
2023001
Amy
1
1
2023002
Brady
1
2
2023003
Caley
4
6
2023004
Dee
2
7
2023005
Emma
2
8
2023006
Favian
3
12
2023007
Grace
3
13
2023008
Harry
3
14
2023009
Ian
3
Please take a look column 'index'.
When I put in several times, 'index' has larger value than I expected.
What should I try to solve this problem?
You could try:
student_dataframe.reset_index()
Actually, the problem situation is 'index' part connected to another table as a FOREIGN KEY.
Every time I add a data, the error occurred because there was no key(because the index value is not continuous!).
I solve this problem by checking the index part once before put data in DB and setting it as key.
Following code is what I tried.
index_no = get_index(db_engine)
dataframe.index = dataframe.index + index_no + 1 - len(dataframe)
dataframe.reset_index(inplace=True)
If anyone has the same problem, it could be nice way to try another way rather than trying to make auto increment key sequential.

group dataframe based on columns

I am new to datascience your help is appreciated. my question is regarding grouping dataframe based on columns so that bar chart will be plotted based on each subject status
my csv file is something like this
Name,Maths,Science,English,sports
S1,Pass,Fail,Pass,Pass
S2,Pass,Pass,NA,Pass
S3,Pass,Fail,Pass,Pass
S4,Pass,Pass,Pass,NA
S5,Pass,Fail,Pass,NA
expected o/p:
Subject,Status,Count
Maths,Pass,5
Science,Pass,2
Science,Fail,3
English,Pass,4
English,NA,1
Sports,Pass,3
Sports,NA,2
You can do this with pandas, not exactly in the same output format in the question, but definitely having the same information:
import pandas as pd
# reading csv
df = pd.read_csv("input.csv")
# turning columns into rows
melt_df = pd.melt(df, id_vars=['Name'], value_vars=['Maths', 'Science', "English", "sports"], var_name="Subject", value_name="Status")
# filling NaN values, otherwise the below groupby will ignore them.
melt_df = melt_df.fillna("Unknown")
# counting per group of subject and status.
result_df = melt_df.groupby(["Subject", "Status"]).size().reset_index(name="Count")
Then you get the following result:
Subject Status Count
0 English Pass 4
1 English Unknown 1
2 Maths Pass 5
3 Science Fail 3
4 Science Pass 2
5 sports Pass 3
6 sports Unknown 2
PS: Going forward, always paste code on what you've tried so far
To match exactly your output, this is what you could do:
import pandas as pd
df = pd.read_csv('c:/temp/data.csv') # Or where ever your csv file is
subjects = ['Maths', 'Science' , 'English' , 'sports'] # Or you could get that as df.columns and drop 'Name'
grouped_rows = []
for eachsub in subjects:
rows = df.groupby(eachsub)['Name'].count()
idx = list(rows.index)
if 'Pass' in idx:
grouped_rows.append([eachsub, 'Pass', rows['Pass']])
if 'Fail' in idx:
grouped_rows.append([eachsub, 'Fail', rows['Fail']])
new_df = pd.DataFrame(grouped_rows, columns=['Subject', 'Grade', 'Count'])
print(new_df)
I must suggest though that I would avoid getting into the for loop. My approach would be just these two lines:
subjects = ['Maths', 'Science' , 'English' , 'sports']
grouped_rows = df.groupby(eachsub)['Name'].count()
Depending on your application, you already have the data available in grouped_rows

How to check if string elements of lists are in dataframe/other list (python)

I got the following problem. I do have 2 lists/dataframes. One is a list/dataframe of customers, where every row is a customer, the columns are synonyms for these customers, other verbal expressions.
customer_list = {'A': ['AA', 'AA', 'AAA'], 'B': ['B', 'BB','BBB'], 'C': ['C','CC','CCC']}
customer_df = pd.DataFrame.from_dict(customer_list, orient='index')
Than I have another dataframe with the following structure:
text = [['A', 'Hello i am AA', 'Hello i am BB', 'Hello i am A'], ['B', 'Hello i am B', 'Hello i am BBB','Hello i am BB'], ['C', 'Hello i am AAA','Hello i am CC','Hello i am CCC']]
text_df = pd.DataFrame(text)
text_df = text_df.set_index(0)
text_df = text_df.rename_axis("customer")
How (which types, which functions) can I check every row (e.g. every element of row "A") of the text_df for "wrong entries", which means for all the elements/synonyms of other customers (so check for every entry besides the own). Do I have to create multiple dataframes in a for loop? Is one loop enough?
Thanks for any advice, even just a hint concerning methods.
For my example, a result like
Wrong texts: A: Hello i am BB,
C: Hello i am AAA or some according indices would be great.
First, I would pd.melt to transform this DataFrame into an "index" of (customer, column, value) triples, like so:
df = pd.melt(text_df.reset_index(), id_vars="customer", var_name="columns")
Now, we have a way of "efficiently" operating over the entire data without needing to figure out the "right" columns and the like. So let's solve the "correctness" problem.
def correctness(melted_row: pd.Series, customer_df: pd.DataFrame) -> bool:
customer = customer_df.loc[melted_row.customer]
cust_ids = customer.values.tolist()
return any([melted_row.value.endswith(cust_id) for cust_id in cust_ids])
Note: You could swap out .endswith with a variety of str functions to match your needs. Take a look at the docs, here.
Lastly, you can generate a mask by using the apply method across rows, like so:
df["correct"] = df.apply(correctness, axis=1, args=(customer_df, ))
You'll then have an output that looks like this:
customer columns value correct
0 A 1 Hello i am AA True
1 B 1 Hello i am B True
2 C 1 Hello i am AAA False
3 A 2 Hello i am BB False
4 B 2 Hello i am BBB True
5 C 2 Hello i am CC True
6 A 3 Hello i am A False
7 B 3 Hello i am BB True
8 C 3 Hello i am CCC True
I imagine you have other things you want to do before "un-melting" you data, so I'll point you to this SO question on how to "un-melt" your data.
By "efficient", I really mean that you have a way of leveraging built-in functions of pandas, not that it's "computationally efficient". My memory is foggy on this, but using .apply(...) is generally something to do as a last-resort. I imagine there are multiple ways to crack this problem that use built-ins, but I find this solution to be the most readable.

Pandas df loop + merge

Hello guys I need your wisdom,
I'm still new to python and pandas and I'm looking to achieve the following thing.
df = pd.DataFrame({'code': [125, 265, 128,368,4682,12,26,12,36,46,1,2,1,3,6], 'parent': [12,26,12,36,46,1,2,1,3,6,'a','b','a','c','f'], 'name':['unknow','unknow','unknow','unknow','unknow','unknow','unknow','unknow','unknow','unknow','g1','g2','g1','g3','g6']})
ds = pd.DataFrame({'code': [125, 265, 128,368,4682], 'name': ['Eagle','Cat','Koala','Panther','Dophin']})
I would like to add a new column in the ds dataframe with the name of the highest parent.
as an example for the first row :
code | name | category
125 | Eagle | a
"a" is the result of a loop between df.code and df.parent 125 > 12 > 1 > a
Since the last parent is not a number but a letter i think I must use a regex and than .merge from pandas to populate the ds['category'] column. Also maybe use an apply function but it seems a little bit above my current knowledge.
Could anyone help me with this?
Regards,
The following is certainly not the fastest solution but it works if your dataframes are not too big. First create a dictionary from the parent codes of df and then apply this dict recursively until you come to an end.
p = df[['code','parent']].set_index('code').to_dict()['parent']
def get_parent(code):
while par := p.get(code):
code = par
return code
ds['category'] = ds.code.apply(get_parent)
Result:
code name category
0 125 Eagle a
1 265 Cat b
2 128 Koala a
3 368 Panther c
4 4682 Dophin f
PS: get_parent uses an assignment expression (Python >= 3.8), for older versions of Python you could use:
def get_parent(code):
while True:
par = p.get(code)
if par:
code = par
else:
return code

Count match in 2 pandas dataframes

I have 2 dataframes containing text as list in each row. This one is called df
Datum File File_type Text
Datum
2000-01-27 2000-01-27 0864820040_000127_04.txt _04 [business, date, jan, heineken, starts, integr..
and i have another one, df_lm which looks like this
List_type Words
0 LM_cnstrain. [abide, abiding, bound, bounded, commit, commi...
1 LM_litigius. [abovementioned, abrogate, abrogated, abrogate...
2 LM_modal_me. [can, frequently, generally, likely, often, ou...
3 LM_modal_st. [always, best, clearly, definitely, definitive...
4 LM_modal_wk. [almost, apparently, appeared, appearing, appe...
I want to create new columns in df, where the match of words should be counted, so for example how many words are there from df_lm.Words[0] in df.Text[0]
Note: df has ca 500 rows and df_lm has 6 -> so i need to create 6 new columns in df so that the updated df looks somewhat like this
Datum ...LM_cnstrain LM_litigius Lm_modal_me ...
2000-01-27 ... 5 3 4
2000-02-25 ... 7 1 0
I hope i was clear on my question.
Thanks in advance!
EDIT:
i have already done smth. similar by creating a list and loop over it, but as the lists in df_lm are very long this is not an option.
The code looked like this:
result_list[]
for file in file_list:
count_growth = 0
for word in text.split ():
if word in growth:
count_growth = count_growth +1
a={'Grwoth':count_growth}
result_list.append(a)
According to my comments you can try something like this:
The below code has to run in a loop where text column from 1st df has to be matched with all 6 from next and make column with value from len(c)
desc = df_lm.iloc[0,1]
matches = df.text.isin(desc)
result = df.text[matches]
If this helps you, let me know otherwise will update/delete the answer
So ive come to the following solution:
for file in file_list:
count_lm_constraint = 0
count_lm_litigious = 0
count_lm_modal_me = 0
for word in text.split()
if word in df_lm.iloc[0,1]:
count_lm_constraint = count_lm_constraint +1
if word in df_lm.iloc[1,1]:
count_lm_litigious = count_lm_litigious +1
if word in df_lm.iloc[2,1]:
count_lm_modal_me = count_lm_modal_me +1
a={"File": name, "Text": text,'lm_uncertain':count_lm_uncertain,'lm_positive':count_lm_positive ....}
result_list.append(a)

Categories