im trying to save some data in a dataframe, the first row of the dataframe should be ('Tom',.99, 'tom2'), supose i need to add ('mart',.3, 'mart2') row to the dataframe , i've tried to use append but is adding nothing this is my code
import pandas as pd
trackeds = {'Name':['Tom'], 'proba':[.99],'name2':['tom2']}
df_trackeds = pd.DataFrame(trackeds)
df_trackeds.append(pd.DataFrame({'name':['mart'],'proba': [.3],'name2':['mart2']}))
print(df_trackeds)
the output is
Name proba name2
0 Tom 0.99 tom2
i also tried to use
df_trackeds.append({'name':['mart'],'proba': [.3],'name2':['mart2']},ignore_index=True)
and
df_trackeds.append(pd.DataFrame({'name':['mart'],'proba': [.3],'name2':['mart2']}))
but nothing, i hope you can help me, thanks in advance
Pandas function DataFrame.append not working inplace like pure python append, so is necessary assign back:
df = pd.DataFrame({'Name':['mart'],'proba': [.3],'name2':['mart2']})
df_trackeds = df_trackeds.append(df, ignore_index=True)
print(df_trackeds)
Name proba name2
0 Tom 0.99 tom2
1 mart 0.30 mart2
Related
I am new to datascience your help is appreciated. my question is regarding grouping dataframe based on columns so that bar chart will be plotted based on each subject status
my csv file is something like this
Name,Maths,Science,English,sports
S1,Pass,Fail,Pass,Pass
S2,Pass,Pass,NA,Pass
S3,Pass,Fail,Pass,Pass
S4,Pass,Pass,Pass,NA
S5,Pass,Fail,Pass,NA
expected o/p:
Subject,Status,Count
Maths,Pass,5
Science,Pass,2
Science,Fail,3
English,Pass,4
English,NA,1
Sports,Pass,3
Sports,NA,2
You can do this with pandas, not exactly in the same output format in the question, but definitely having the same information:
import pandas as pd
# reading csv
df = pd.read_csv("input.csv")
# turning columns into rows
melt_df = pd.melt(df, id_vars=['Name'], value_vars=['Maths', 'Science', "English", "sports"], var_name="Subject", value_name="Status")
# filling NaN values, otherwise the below groupby will ignore them.
melt_df = melt_df.fillna("Unknown")
# counting per group of subject and status.
result_df = melt_df.groupby(["Subject", "Status"]).size().reset_index(name="Count")
Then you get the following result:
Subject Status Count
0 English Pass 4
1 English Unknown 1
2 Maths Pass 5
3 Science Fail 3
4 Science Pass 2
5 sports Pass 3
6 sports Unknown 2
PS: Going forward, always paste code on what you've tried so far
To match exactly your output, this is what you could do:
import pandas as pd
df = pd.read_csv('c:/temp/data.csv') # Or where ever your csv file is
subjects = ['Maths', 'Science' , 'English' , 'sports'] # Or you could get that as df.columns and drop 'Name'
grouped_rows = []
for eachsub in subjects:
rows = df.groupby(eachsub)['Name'].count()
idx = list(rows.index)
if 'Pass' in idx:
grouped_rows.append([eachsub, 'Pass', rows['Pass']])
if 'Fail' in idx:
grouped_rows.append([eachsub, 'Fail', rows['Fail']])
new_df = pd.DataFrame(grouped_rows, columns=['Subject', 'Grade', 'Count'])
print(new_df)
I must suggest though that I would avoid getting into the for loop. My approach would be just these two lines:
subjects = ['Maths', 'Science' , 'English' , 'sports']
grouped_rows = df.groupby(eachsub)['Name'].count()
Depending on your application, you already have the data available in grouped_rows
I have this dataframe data where i have like 10.000 records of sold items for 201 authors.
I want to add a column to this dataframe which is the average price for each author.
First i create this new column average_price and then i create another dataframe df
where i have 201 columns of authors and their average price. (at least i think this is the right way to do this)
data["average_price"] = 0
df = data.groupby('Author Name', as_index=False)['price'].mean()
df looks like this
Author Name price
0 Agnes Cleve 107444.444444
1 Akseli Gallen-Kallela 32100.384615
2 Albert Edelfelt 207859.302326
3 Albert Johansson 30012.000000
4 Albin Amelin 44400.000000
... ... ...
196 Waldemar Lorentzon 152730.000000
197 Wilhelm von Gegerfelt 25808.510638
198 Yrjö Edelmann 53268.928571
199 Åke Göransson 87333.333333
200 Öyvind Fahlström 351345.454545
Now i want to use this df to populate the average_price column in the larger dataframe data.
I could not come up with how to do this so i tried a for loop which is not working. (And i know you should avoid for loops working with dataframes)
for index, row in data.iterrows():
for ind, r in df.iterrows():
if row["Author Name"] == r["Author Name"]:
row["average_price"] = r["price"]
So i wonder how this should be done?
You can use transform and groupby to add a new column:
data['average price'] = data.groupby('Author Name')['price'].transform('mean')
I think based on what you described, you should use .join method on a Pandas dataframe. You don't need to create 'average_price' column mannualy. This should simply work for your case:
df = data.groupby('Author Name', as_index=False)['price'].mean().rename(columns={'price':'average_price'})
data = data.join(df, on="Author Name")
Now you can get the average price from data['average_price'] column.
Hope this could help!
I think the easiest way to do that would be using join (aka pandas.merge)
df_data = pd.DataFrame([...]) # your data here
df_agg_data = data.groupby('Author Name', as_index=False)['price'].mean()
df_data = df_data.merge(df_agg_data, on="Author Name")
print(df_data)
I want to "explode" each cell that has multiple words in it into distinct rows while retaining it's rating and sysnet value when being conjoined. I attempted to import someone's pandas_explode library but VS code just does not want to recognize it. Is there any way for me in pandas documentation or some nifty for loop that'll extract and redistribute these words? Example csv is in the img link
import json
import pandas as pd # version 1.01
df = pd.read_json('result.json')
df.to_csv('jsonToCSV.csv', index=False)
df = pd.read_csv('jsonToCSV.csv')
df = df.explode('words')
print(df)
df = df.to_csv(r'C:\Users\alant\Desktop\test.csv', index = None, header=True)
Output when running above:
synset rating words
0 1034312 0.0 ['discourse', 'talk about', 'discuss']
1 146856 0.0 ['merging', 'meeting', 'coming together']
2 829378 0.0 ['care', 'charge', 'tutelage', 'guardianship']
3 8164585 0.0 ['administration', 'governance', 'governing bo...
4 1204318 0.0 ['nonhierarchical', 'nonhierarchic']
... ... ... ...
8605 7324673 1.0 ['emergence', 'outgrowth', 'growth']
csv file
If you have columns that needs to be kept from exploding, I suggest setting them as index first and then explode.
For your example, try if this works for you.
df = df.set_index(['synset','rating']).apply(pd.Series.explode) # this would work for exploding multiple columns as well
# then reset the index
df = df.reset_index()
I got the following csv file with sample data:
Now I want to replace the columns 'SIFT' and 'PolyPhen' values with the data inside the parentheses of these columns. So for row 1 the SIFT value will replace to 0.82, and for row 2 the SIFT value will be 0.85. Also I want the part before the parentheses, tolerated/deleterious, inside a new column named 'SIFT_prediction'.
This is what I tried so far:
import pandas as pd
import re
testfile = 'test_sift_columns.csv'
df = pd.read_csv(testfile)
df['SIFT'].re.search(r'\((.*?)\)',s).group(1)
This code will take everything inside the parentheses of the column SIFT. But this does not replace anything. I probably need a for loop to read and replace every row but I don't know how to do it correctly. Also I am not sure if using a regular expression is necessary with pandas. Maybe there is a smarter way to resolve my problem.
Use Series.str.extract:
df = pd.DataFrame({'SIFT':['tol(0.82)','tol(0.85)','tol(1.42)'],
'PolyPhen':['beg(0)','beg(0)','beg(0)']})
pat = r'(.*?)\((.*?)\)'
df[['SIFT_prediction','SIFT']] = df['SIFT'].str.extract(pat)
df[['PolyPhen_prediction','PolyPhen']] = df['PolyPhen'].str.extract(pat)
print(df)
SIFT_prediction SIFT PolyPhen_prediction PolyPhen
0 tol 0.82 beg 0
1 tol 0.85 beg 0
2 tol 1.42 beg 0
Alternative:
df[['SIFT_prediction','SIFT']] = df['SIFT'].str.rstrip(')').str.split('(', expand=True)
df[['PolyPhen_prediction','PolyPhen']] = df['PolyPhen'].str.rstrip(')').str.split('(', expand=True)
You can do something like replacing all alphanumeric values with empty strings in order to get the float value and the opposite to get the predicion.
import pandas as pd
df = pd.DataFrame({'ID': [1,2,3,4], 'SIFT': ['tolerated(0.82)', 'tolerated(0.85)', 'tolerated(0.25)', 'dedicated(0.5)']})
df['SIFT_formatted'] = df.SIFT.str.replace('[^0-9.]', '', regex=True).astype(float)
df['SIFT_prediction'] = df.SIFT.str.replace('[^a-zA-Z]', '', regex=True)
df
Would give you:
ID SIFT SIFT_formatted SIFT_prediction
0 1 tolerated(0.82) 0.82 tolerated
1 2 tolerated(0.85) 0.85 tolerated
2 3 tolerated(0.25) 0.25 tolerated
3 4 dedicated(0.5) 0.50 dedicated
Let's say I have a list of names like this one in a csv:
Nom;Link;NonLink
Deb;John;
John;Deb;
Martha;Travis;
Travis;Martha;
Allan;;
Lois;;
Jayne;;
Brad;;Abby
Abby;;Brad
I imported it using numpy:
import numpy as np
file = np.genfromtxt('liste.csv', dtype=None, delimiter =';',skip_header=1)
Now, I'm isolating my first column:
Nom = np.array(file[:,0])
I would like to create a matrix using only this first column to get a result like this one:
Deb John Martha etc...
Deb 0 0 0 ...
John 0 0 0 ...
Martha 0 0 0 ...
etc...
Is there a numpy function for that?
Edit: My end goal is to make a little program to assign seats at tables where people in Link must be seated at the same table and NonLink must not be at the same table.
Thank you,
You can use pandas and create a dataframe using Nom variable.
Something like this:
import pandas as pd
df = pd.DataFrame([[0] * len(Nom)] * len(Nom), Nom, Nom)
print(df)