I have a .csv file with the following data
Roll,Subject,Marks,Pass_Fail
1,A,50,P
1,B,50,P
1,C,30,F
1,D,50,P
2,A,40,P
2,B,30,F
2,C,30,F
2,D,50,P
3,A,50,P
3,B,30,F
3,C,40,P
3,D,20,F
4,A,50,P
4,B,50,P
4,C,50,P
4,D,50,P
Now, I would like to check if any person has failed in both B & C or D
Output -
2,B,30,F
2,C,30,F
3,B,30,F
3,D,20,F
I am new to Python. I have used Pandas. But only able to get the unique Roll value.
My code is as bellow
import pandas as pd
dataFrame = pd.read_csv(".\students.csv")
Unique_Users=dataFrame['roll'].unique()
for each roll in dataFrame:
if dataFrame.loc['pass_fail'] == 'fail':
print (dataFrame)
"to check if any person has failed in both B & C or D"
Use dataframe filtering on specific conditions:
failed = df[df['Subject_Code'].isin(['B','C','D']) & df['Pass_Fail'].eq('F')]
print(failed)
Roll Subject_Code Marks Pass_Fail
6 2 C 20 F
7 2 D 25 F
import pandas as pd
from pandas import DataFrame
df = pd.read_csv(".\students.csv")
failed = df[(df['subject'].eq('B')) & df['pass_fail'].eq('fail')]
df1 = DataFrame(failed, columns=
(['roll','name','subject','marks','pass_fail']))
for name in failed:
failed_1 = df[df['subject'].isin('C','D') & df['pass_fail'].eq('fail')]
df2 = DataFrame(failed_1)
Related
I have to work on a flat file (size > 500 Mo) and I need to create to split file on one criterion.
My original file as this structure (simplified):
JournalCode|JournalLib|EcritureNum|EcritureDate|CompteNum|
I need to create to file depending on the first digit from 'CompteNum'.
I have started my code as well
import sys
import pandas as pd
import numpy as np
import datetime
C_FILE_SEP = "|"
def main(fic):
pd.options.display.float_format = '{:,.2f}'.format
FileFec = pd.read_csv(fic, C_FILE_SEP, encoding= 'unicode_escape')
It seems ok, my concern is to create my 2 files based on criteria. I have tried with unsuccess.
TargetFec = 'Target_'+fic+datetime.datetime.now().strftime("%Y%m%d-%H%M%S")+'.txt'
target = open(TargetFec, 'w')
FileFec = FileFec.astype(convert_dict)
for row in FileFec.iterrows():
Fec_Cpt = str(FileFec['CompteNum'])
nb = len(Fec_Cpt)
if (nb > 7):
target.write(str(row))
target.close()
the result of my target file is not like I expected:
(0, JournalCode OUVERT
JournalLib JOURNAL D'OUVERTURE
EcritureNum XXXXXXXXXX
EcritureDate 20190101
CompteNum 101300
CompteLib CAPITAL SOUSCRIT
CompAuxNum
CompAuxLib
PieceRef XXXXXXXXXX
PieceDate 20190101
EcritureLib A NOUVEAU
Debit 000000000000,00
Credit 000038188458,00
EcritureLet NaN
DateLet NaN
ValidDate 20190101
Montantdevise
Idevise
CodeEtbt 100
Unnamed: 19 NaN
And I expected to obtain line into my target file when CompteNum(0:1) > 7
I have read many posts for 2 days, please some help will be perfect.
There is a sample of my data available here
Philippe
Suiting the rules and the desired format, you can use logic like:
# criteria:
verify = df['CompteNum'].apply(lambda number: str(number)[0] == '8' or str(number)[0] == '9')
# saving the dataframes:
df[verify].to_csv('c:/users/jack/desktop/meets-criterios.csv', sep = '|', index = False)
Original comment:
As I understand it, you want to filter the imported dataframe according to some criteria. You can work directly on the pandas you imported. Look:
# criteria:
verify = df['CompteNum'].apply(lambda number: len(str(number)) > 7)
# filtering the dataframe based on the given criteria:
df[verify] # meets the criteria
df[~verify] # does not meet the criteria
# saving the dataframes:
df[verify].to_csv('<your path>/meets-criterios.csv')
df[~verify].to_csv('<your path>/not-meets-criterios.csv')
Once you have the filtered dataframes, you can save them or convert them to other objects, such as dictionaries.
I am new to datascience your help is appreciated. my question is regarding grouping dataframe based on columns so that bar chart will be plotted based on each subject status
my csv file is something like this
Name,Maths,Science,English,sports
S1,Pass,Fail,Pass,Pass
S2,Pass,Pass,NA,Pass
S3,Pass,Fail,Pass,Pass
S4,Pass,Pass,Pass,NA
S5,Pass,Fail,Pass,NA
expected o/p:
Subject,Status,Count
Maths,Pass,5
Science,Pass,2
Science,Fail,3
English,Pass,4
English,NA,1
Sports,Pass,3
Sports,NA,2
You can do this with pandas, not exactly in the same output format in the question, but definitely having the same information:
import pandas as pd
# reading csv
df = pd.read_csv("input.csv")
# turning columns into rows
melt_df = pd.melt(df, id_vars=['Name'], value_vars=['Maths', 'Science', "English", "sports"], var_name="Subject", value_name="Status")
# filling NaN values, otherwise the below groupby will ignore them.
melt_df = melt_df.fillna("Unknown")
# counting per group of subject and status.
result_df = melt_df.groupby(["Subject", "Status"]).size().reset_index(name="Count")
Then you get the following result:
Subject Status Count
0 English Pass 4
1 English Unknown 1
2 Maths Pass 5
3 Science Fail 3
4 Science Pass 2
5 sports Pass 3
6 sports Unknown 2
PS: Going forward, always paste code on what you've tried so far
To match exactly your output, this is what you could do:
import pandas as pd
df = pd.read_csv('c:/temp/data.csv') # Or where ever your csv file is
subjects = ['Maths', 'Science' , 'English' , 'sports'] # Or you could get that as df.columns and drop 'Name'
grouped_rows = []
for eachsub in subjects:
rows = df.groupby(eachsub)['Name'].count()
idx = list(rows.index)
if 'Pass' in idx:
grouped_rows.append([eachsub, 'Pass', rows['Pass']])
if 'Fail' in idx:
grouped_rows.append([eachsub, 'Fail', rows['Fail']])
new_df = pd.DataFrame(grouped_rows, columns=['Subject', 'Grade', 'Count'])
print(new_df)
I must suggest though that I would avoid getting into the for loop. My approach would be just these two lines:
subjects = ['Maths', 'Science' , 'English' , 'sports']
grouped_rows = df.groupby(eachsub)['Name'].count()
Depending on your application, you already have the data available in grouped_rows
I have written a code that merges File B into File A based on a column 'Code'. Some of the values from File B, however, are generic ('Color') and I would need to do another merge with file C - but instead of creating a new column I would like to use the same column created during the first merge and whenever first merge returned value 'Color' only for those rows do merge with file C to get the proper value.
I went as far as merging A with B:
import pandas as pd
File_A = pd.read_excel(r'.../My Files/Python/Supplier cat testing/File A.xlsx')
File_B = pd.read_excel(r'.../My Files/Python/Supplier cat testing/File B.xlsx')
File_C = pd.read_excel(r'.../My Files/Python/Supplier cat testing/File C.xlsx')
results = pd.merge(File_A, File_B[['Code','Color']], on='Code')
results.to_excel('Output_File.xlsx', index=False)
Would anyone have any idea where do I even start, please?
Try :
dfOut = dfB.copy()
dfOut['Color'] = dfB.merge(dfC, on='Code').apply(lambda r: r.Color_x if r.Color_y == 'Color' else r.Color_y, axis=1)
print(df)
Output
Code Color
0 0 Green
1 1 Yellow
2 2 Orange
3 3 Red
4 4 Black
I have dataframes in my local named like s1_down_threshold, s1_up_threshold, s2_down_threshold,s2_down_threshold, s19_down_threshold, s19_down_threshold and so on.
I would like to sort the dataframes having 'down_threshold' in their names in descending order based on one column and the dataframes having 'up_threshold' in their names in ascending order based on the same column.
I know that I can use .sort_values() for each and every one of them, but I would like to know if there is a more efficient way to do it?
I was hoping for something as follows:
Going through the names of all the dataframes in my local and then finding the dataframes with 'down_threshold' in their names and sorting them accordingly and then repeating the same process for 'up_threshold'
Edit 1:
You can name the data frame before adding it to Dataframe dictionary like below:
import numpy as np
import pandas as pd
import json
#using sample data
data = {'id': ['A', 'B', 'C', 'D'], 'value': [2000, 600, 400, 3000]}
df=pd.DataFrame(data)
df1 =df.copy()
df2=df.copy()
df3=df.copy()
df4=df.copy()
DataFrameDict=[]
df1.name='s1_down_threshold'
DataFrameDict.append(df1)
df2.name='s2_down_threshold'
DataFrameDict.append(df2)
df3.name='s1_up_threshold'
DataFrameDict.append(df3)
df4.name='s2_up_threshold'
DataFrameDict.append(df4)
for i in range(len(DataFrameDict)):
if ('down' in DataFrameDict[i].name):
print (DataFrameDict[i].name,'sorted down')
DataFrameDict[i].sort_values(by='value', ascending=False,inplace=True)
elif ('up' in DataFrameDict[i].name):
print (DataFrameDict[i].name,'sorted up')
DataFrameDict[i].sort_values(by='value', ascending=True,inplace=True)
>>> DataFrameDict
[ id value
3 D 3000
0 A 2000
1 B 600
2 C 400,
id value
3 D 3000
0 A 2000
1 B 600
2 C 400,
id value
2 C 400
1 B 600
0 A 2000
3 D 3000,
id value
2 C 400
1 B 600
0 A 2000
3 D 3000]
If all your df's are saved in the csv format in the same folder you can use the os library to import all of them and use the split function to decide wether to sort them in ascending or in a descending fashion.
Here is what it could look like :
import os
for file in os.listdir('./folder1/'):
pd.read_csv('folder1/' + file)
if file.split('.')[0].split('_')[1] =='down':
df.sort_values(ascending = False, inplace = True)
elif file.split('.')[0].split('_')[1] =='up':
df.sort_values(ascending = True, inplace = True)
df.to_csv('folder1/' + file)
If there are other files in your local directory you can read the csv inside the if and elif conditions.
pandas provides an useful to_html() to convert the DataFrame into the html table. Is there any useful function to read it back to the DataFrame?
The read_html utility released in pandas 0.12
In the general case it is not possible but if you approximately know the structure of your table you could something like this:
# Create a test df:
>>> df = DataFrame(np.random.rand(4,5), columns = list('abcde'))
>>> df
a b c d e
0 0.675006 0.230464 0.386991 0.422778 0.657711
1 0.250519 0.184570 0.470301 0.811388 0.762004
2 0.363777 0.715686 0.272506 0.124069 0.045023
3 0.657702 0.783069 0.473232 0.592722 0.855030
Now parse the html and reconstruct:
from pyquery import PyQuery as pq
d = pq(df.to_html())
columns = d('thead tr').eq(0).text().split()
n_rows = len(d('tbody tr'))
values = np.array(d('tbody tr td').text().split(), dtype=float).reshape(n_rows, len(columns))
>>> DataFrame(values, columns=columns)
a b c d e
0 0.675006 0.230464 0.386991 0.422778 0.657711
1 0.250519 0.184570 0.470301 0.811388 0.762004
2 0.363777 0.715686 0.272506 0.124069 0.045023
3 0.657702 0.783069 0.473232 0.592722 0.855030
You could extend it for Multiindex dfs or automatic type detection using eval() if needed.