Explode a cell populated with multiple values into unique rows - python

I want to "explode" each cell that has multiple words in it into distinct rows while retaining it's rating and sysnet value when being conjoined. I attempted to import someone's pandas_explode library but VS code just does not want to recognize it. Is there any way for me in pandas documentation or some nifty for loop that'll extract and redistribute these words? Example csv is in the img link
import json
import pandas as pd # version 1.01
df = pd.read_json('result.json')
df.to_csv('jsonToCSV.csv', index=False)
df = pd.read_csv('jsonToCSV.csv')
df = df.explode('words')
print(df)
df = df.to_csv(r'C:\Users\alant\Desktop\test.csv', index = None, header=True)
Output when running above:
synset rating words
0 1034312 0.0 ['discourse', 'talk about', 'discuss']
1 146856 0.0 ['merging', 'meeting', 'coming together']
2 829378 0.0 ['care', 'charge', 'tutelage', 'guardianship']
3 8164585 0.0 ['administration', 'governance', 'governing bo...
4 1204318 0.0 ['nonhierarchical', 'nonhierarchic']
... ... ... ...
8605 7324673 1.0 ['emergence', 'outgrowth', 'growth']
csv file

If you have columns that needs to be kept from exploding, I suggest setting them as index first and then explode.
For your example, try if this works for you.
df = df.set_index(['synset','rating']).apply(pd.Series.explode) # this would work for exploding multiple columns as well
# then reset the index
df = df.reset_index()

Related

group dataframe based on columns

I am new to datascience your help is appreciated. my question is regarding grouping dataframe based on columns so that bar chart will be plotted based on each subject status
my csv file is something like this
Name,Maths,Science,English,sports
S1,Pass,Fail,Pass,Pass
S2,Pass,Pass,NA,Pass
S3,Pass,Fail,Pass,Pass
S4,Pass,Pass,Pass,NA
S5,Pass,Fail,Pass,NA
expected o/p:
Subject,Status,Count
Maths,Pass,5
Science,Pass,2
Science,Fail,3
English,Pass,4
English,NA,1
Sports,Pass,3
Sports,NA,2
You can do this with pandas, not exactly in the same output format in the question, but definitely having the same information:
import pandas as pd
# reading csv
df = pd.read_csv("input.csv")
# turning columns into rows
melt_df = pd.melt(df, id_vars=['Name'], value_vars=['Maths', 'Science', "English", "sports"], var_name="Subject", value_name="Status")
# filling NaN values, otherwise the below groupby will ignore them.
melt_df = melt_df.fillna("Unknown")
# counting per group of subject and status.
result_df = melt_df.groupby(["Subject", "Status"]).size().reset_index(name="Count")
Then you get the following result:
Subject Status Count
0 English Pass 4
1 English Unknown 1
2 Maths Pass 5
3 Science Fail 3
4 Science Pass 2
5 sports Pass 3
6 sports Unknown 2
PS: Going forward, always paste code on what you've tried so far
To match exactly your output, this is what you could do:
import pandas as pd
df = pd.read_csv('c:/temp/data.csv') # Or where ever your csv file is
subjects = ['Maths', 'Science' , 'English' , 'sports'] # Or you could get that as df.columns and drop 'Name'
grouped_rows = []
for eachsub in subjects:
rows = df.groupby(eachsub)['Name'].count()
idx = list(rows.index)
if 'Pass' in idx:
grouped_rows.append([eachsub, 'Pass', rows['Pass']])
if 'Fail' in idx:
grouped_rows.append([eachsub, 'Fail', rows['Fail']])
new_df = pd.DataFrame(grouped_rows, columns=['Subject', 'Grade', 'Count'])
print(new_df)
I must suggest though that I would avoid getting into the for loop. My approach would be just these two lines:
subjects = ['Maths', 'Science' , 'English' , 'sports']
grouped_rows = df.groupby(eachsub)['Name'].count()
Depending on your application, you already have the data available in grouped_rows

Import multiple excel files, create a column and get values from excel file's name

I need to upload multiple excel files - each one has a name of starting date. Eg. "20190114".
Then I need to append them in one DataFrame.
For this, I use the following code:
all_data = pd.DataFrame()
for f in glob.glob('C:\\path\\*.xlsx'):
df = pd.read_excel(f)
all_data = all_data.append(df,ignore_index=True)
In fact, I do not need all data, but filtered by multiple columns.
Then, I would like to create an additional column ('from') with values of file name (which is "date") for each respective file.
Example:
Data from the excel file, named '20190101'
Data from the excel file, named '20190115'
The final dataframe must have values in 'price' column not equal to '0' and in code column - with code='r' (I do not know if it's possible to export this data already filtered, avoiding exporting huge volume of data?) and then I need to add a column 'from' with the respective date coming from file's name:
like this:
dataframes for trial:
import pandas as pd
df1 = pd.DataFrame({'id':['id_1', 'id_2','id_3', 'id_4','id_5'],
'price':[0,12.5,17.5,24.5,7.5],
'code':['r','r','r','c','r'] })
df2 = pd.DataFrame({'id':['id_1', 'id_2','id_3', 'id_4','id_5'],
'price':[7.5,24.5,0,149.5,7.5],
'code':['r','r','r','c','r'] })
IIUC, you can filter necessary rows ,then concat, for file name you can use os.path.split() and access the filename with string slicing:
l=[]
for f in glob.glob('C:\\path\\*.xlsx'):
df=pd.read_excel(f)
df['from']=os.path.split(f)[1][:-5]
l.append(df[(df['code'].eq('r')&df['price'].ne(0))])
pd.concat(l,ignore_index=True)
id price code from
0 id_2 12.5 r 20190101
1 id_3 17.5 r 20190101
2 id_5 7.5 r 20190101
3 id_1 7.5 r 20190115
4 id_2 24.5 r 20190115
5 id_5 7.5 r 20190115

Replace the value inside a csv column by value inside parentheses of the same column using python pandas

I got the following csv file with sample data:
Now I want to replace the columns 'SIFT' and 'PolyPhen' values with the data inside the parentheses of these columns. So for row 1 the SIFT value will replace to 0.82, and for row 2 the SIFT value will be 0.85. Also I want the part before the parentheses, tolerated/deleterious, inside a new column named 'SIFT_prediction'.
This is what I tried so far:
import pandas as pd
import re
testfile = 'test_sift_columns.csv'
df = pd.read_csv(testfile)
df['SIFT'].re.search(r'\((.*?)\)',s).group(1)
This code will take everything inside the parentheses of the column SIFT. But this does not replace anything. I probably need a for loop to read and replace every row but I don't know how to do it correctly. Also I am not sure if using a regular expression is necessary with pandas. Maybe there is a smarter way to resolve my problem.
Use Series.str.extract:
df = pd.DataFrame({'SIFT':['tol(0.82)','tol(0.85)','tol(1.42)'],
'PolyPhen':['beg(0)','beg(0)','beg(0)']})
pat = r'(.*?)\((.*?)\)'
df[['SIFT_prediction','SIFT']] = df['SIFT'].str.extract(pat)
df[['PolyPhen_prediction','PolyPhen']] = df['PolyPhen'].str.extract(pat)
print(df)
SIFT_prediction SIFT PolyPhen_prediction PolyPhen
0 tol 0.82 beg 0
1 tol 0.85 beg 0
2 tol 1.42 beg 0
Alternative:
df[['SIFT_prediction','SIFT']] = df['SIFT'].str.rstrip(')').str.split('(', expand=True)
df[['PolyPhen_prediction','PolyPhen']] = df['PolyPhen'].str.rstrip(')').str.split('(', expand=True)
You can do something like replacing all alphanumeric values with empty strings in order to get the float value and the opposite to get the predicion.
import pandas as pd
df = pd.DataFrame({'ID': [1,2,3,4], 'SIFT': ['tolerated(0.82)', 'tolerated(0.85)', 'tolerated(0.25)', 'dedicated(0.5)']})
df['SIFT_formatted'] = df.SIFT.str.replace('[^0-9.]', '', regex=True).astype(float)
df['SIFT_prediction'] = df.SIFT.str.replace('[^a-zA-Z]', '', regex=True)
df
Would give you:
ID SIFT SIFT_formatted SIFT_prediction
0 1 tolerated(0.82) 0.82 tolerated
1 2 tolerated(0.85) 0.85 tolerated
2 3 tolerated(0.25) 0.25 tolerated
3 4 dedicated(0.5) 0.50 dedicated

Replace cell values in each row of pandas column using for binarizer and for loop

I need some help here. I'm trying to change one column in my .csv file, which some are empty and some are with a list of categories. As follow:
tdaa_matParent,tdaa_matParentQty
[],[]
[],[]
[],[]
[BCA_Aluminum],[1.3458]
[BCA_Aluminum],[1.3458]
[BCA_Aluminum],[1.3458]
[BCA_Aluminum],[1.3458]
[],[]
[Dye Penetrant Solution, BCA_Aluminum],[0.002118882, 1.3458]
But so far I managed to only binarize the first column (tdaa_matParent), but not able to replace the 1s to their corresponding quantity value, like this.
s = materials['tdaa_matParent']
mlb = MultiLabelBinarizer()
df = pd.DataFrame(mlb.fit_transform(s),columns=mlb.classes_)
BCA_Aluminum,Dye Penetrant Solution,tdaa_matParentQty
0,0,[]
0,0,[]
0,0,[]
1,0,[1.3458,0]
1,0,[1.3458,0]
1,0,[1.3458,0]
1,0,[1.3458,0]
0,0,[]
1,1,[1.3458,0.002118882]
But what I really want is a new set of columns for each column category (i.e. BCA_Aluminum and Dye Penetrant Solution). Also each of the columns if filled to be replaced by the second column´s (tdaa_matParentQty) value.
For example:
BCA_Aluminum,Dye Penetrant Solution
0,0
0,0
0,0
1.3458,0
1.3458,0
1.3458,0
1.3458,0
0,0
1.3458,0.002118882
Thanks! I built another approach that also works (bit slower though). Any suggestions, feel free to share :)
df_matParent_with_Qty = pd.DataFrame()
# For each row in the dataframe (index and row´s column info),
for index, row in ass_materials.iterrows():
# For each row iteration save name of the element (matParent) and it´s index number:
for i, element in enumerate(row["tdaa_matParent"]):
# print(i)
# print(element)
# Fill in the empty dataframe with lists from each element
# And in each of their corresponding index (row), replace it with the value index inside the matParentqty list.
df_matParent_with_Qty.loc[index,element] = row['tdaa_matParentQty'][i]
df_matParent_with_Qty.head(10)
This is how I would do it with built-in Python means for the sample data provided in the question:
from collections import OrderedDict
import pandas as pd
# simple case - material names are known before we process the data - allows to solve the problem with a single for loop
# OrderedDict is used to preserve the order of material names during the processing
base_result = OrderedDict([
('BCA_Aluminum', .0),
('Dye Penetrant Solution', .0)])
result = list()
with open('1.txt', mode='r', encoding='UTF-8') as file:
# skip header
file.readline()
for line in file:
# copy base_result to reuse it during the looping
base_result_copy = base_result.copy()
# modify base result only if there are values in the current line
if line != '[],[]\n':
names, values = line.strip('[]\n').split('],[')
for name, value in zip(names.split(', '), values.split(', ')):
base_result_copy[name] = float(value)
# append new line (base or modified) to the result
result.append(base_result_copy.values())
# turn list of lists into pandas dataframe
result = pd.DataFrame(result, columns=base_result.keys())
print(result)
Output:
BCA_Aluminum Dye Penetrant Solution
0 0.0000 0.000000
1 0.0000 0.000000
2 0.0000 0.000000
3 1.3458 0.000000
4 1.3458 0.000000
5 1.3458 0.000000
6 1.3458 0.000000
7 0.0000 0.000000
8 1.3458 0.002119
0.002119 instead of 0.002118882 is because of how pandas displays floats by default, original precision is preserved in the actual data in the dataframe.

Join multiple CSV files by using python pandas

I am trying to create a CSV file from multiple csv files by using python pandas.
accreditation.csv :-
"pid","accreditation_body","score"
"25799","TAAC","4.5"
"25796","TAAC","5.6"
"25798","DAAC","5.7"
ref_university :-
"id","pid","survery_year","end_year"
"1","25799","2018","2018"
"2","25797","2016","2018"
I want to create a new table by reading the instruction from table_structure.csv. I want to join two tables and rewrite the accreditation.csv . REFERENCES ref_university(id, survey_year) is connecting with ref_university.csv and inserting id and survery_year columns value by matching the pid column value.
table_structure.csv :-
table_name,attribute_name,attribute_type,Description
,,,
accreditation,accreditation_body,varchar,
,grading,varchar,
,pid,int4, "REFERENCES ref_university(id, survey_year)"
,score,float8,
Modified CSV file should look like,
New accreditation.csv :-
"accreditation_body","grading","pid","id","survery_year","score"
"TAAC","","25799","1","2018","2018","4.5"
"TAAC","","25797","2","2016","2018","5.6"
"DAAC","","25798","","","","5.7"
I can read the csv in panda
df = pd.read_csv("accreditation.csv")
But, what is the recommended way to read the REFERENCES instruction and pick the columns value. If there is no value then column should be blank.
We can not hardcore pid in panda function. We have to read table_structure.csv and match if there is a Reference then call the mentioned columns. It should not be merged, just the specific columns should be added.
Dynamic solution is possible, but not so easy:
df = pd.read_csv("table_structure.csv")
#remove only NaNs rows
df = df.dropna(how='all')
#repalce NaNs by forward filling
df['table_name'] = df['table_name'].ffill()
#create for each table_name one row
df = (df.dropna(subset=['Description'])
.join(df.groupby('table_name')['attribute_name'].apply(list)
.rename('cols'), 'table_name'))
#get name of DataFrame and new columns names
df['df1'] = df['Description'].str.extract('REFERENCES\s*(.*)\s*\(')
df['new_cols'] = df['Description'].str.extract('\(\s*(.*)\s*\)')
df['new_cols'] = df['new_cols'].str.split(', ')
#remove unnecessary columns
df = df.drop(['attribute_type','Description'], axis=1).set_index('table_name')
print (df)
table_name
accreditation pid [accreditation_body, grading, pid, score]
df1 new_cols
table_name
accreditation ref_university [id, survey_year]
#for select by named create dictioanry of DataFrames
data = {'accreditation' : pd.read_csv("accreditation.csv"),
'ref_university': pd.read_csv("ref_university.csv")}
#seelct by index
v = df.loc['accreditation']
print (v)
attribute_name pid
cols [accreditation_body, grading, pid, score]
df1 ref_university
new_cols [id, survey_year]
Name: accreditation, dtype: object
Select by dictionary and by Series v
df = pd.merge(data[v.name],
data[v['df1']][v['new_cols'] + [v['attribute_name']]],
on=v['attribute_name'],
how='left')
is converted to:
df = pd.merge(data['accreditation'],
data['ref_university'][['id', 'survey_year'] + ['pid']],
on='pid',
how='left')
and return:
print (df)
pid accreditation_body score id survey_year
0 25799 TAAC 4.5 1.0 2018.0
1 25796 TAAC 5.6 NaN NaN
2 25798 DAAC 5.7 NaN NaN
Last add new columns by union and reindex:
df = df.reindex(columns=df.columns.union(v['cols']))
print (df)
accreditation_body grading id pid score survey_year
0 TAAC NaN 1.0 25799 4.5 2018.0
1 TAAC NaN NaN 25796 5.6 NaN
2 DAAC NaN NaN 25798 5.7 NaN
Here is the working code. Try it! When files are huge set low_memory=False in pd.read_csv()
import pandas as pd
import glob
# gets path to the folder datafolder
path = r"C:\Users\data_folder"
# reads all files with.csv ext
filenames = glob.glob(path + "\*.csv")
print('File names:', filenames)
df=pd.DataFrame()
# for loop to iterate and concat csv files
for file in filenames:
temp=pd.read_csv(file,low_memory=False)
df= pd.concat([df, temp], axis=1) #set axis =0 if you want to join rows
df.to_csv('output.csv')

Categories