I am using Pandas with PsychoPy to reorder my results in a dataframe. The problem is that the dataframe is going to vary according to the participant performance. However, I would like to have a common dataframe, where non-existing columns are created as empty. Then the columns have to be in a specific order in the output file.
Let´s suppose I have a dataframe from a participant with the following columns:
x = ["Error_1", "Error_2", "Error_3"]
I want the final dataframe to look like this:
x = x[["Error_1", "Error_2", "Error_3", "Error_4"]]
Where "Error_4" is created as an empty column.
I tried applying something like this (adapted from another question):
if "Error_4" not in x:
x["Error_4"] = ""
x = x[["Error_1", "Error_2", "Error_3", "Error_4"]]
In principle it should work, however I have more or less other 70 columns for which I should do this, and it doesn´t seem practical to do it for each of them.
Do you have any suggestions?
I also tried creating a new dataframe with all the possible columns, e.g.:
y = ["Error_1", "Error_2", "Error_3", "Error_4"]
However, it is still not clear to me how to merge the dataframes x and y skipping columns with the same header.
Use DataFrame.reindex:
x = x.reindex(["Error_1", "Error_2", "Error_3", "Error_4"], axis=1, fill_value='')
Thanks for the reply, I followed your suggestion and adapted it. I post it here, since it may be useful for someone else.
First I create a dataframe y as I want my output to look like:
y = ["Error_1", "Error_2", "Error_3", "Error_4", "Error_5", "Error_6"]
Then, I get my actual output file df and modify it as df2, adding to it all the columns of y in the exact same order.
df = pd.DataFrame(myData)
columns = df.columns.values.tolist()
df2 = df.reindex(columns = y, fill_value='')
In this case, all the columns that are absent in df2 but are present in y, are going to be added to df2.
However, let´s suppose that in df2 there is a column "Error_7" absent in y. To keep track of these columns I am just applying merge and creating a new dataframe df3:
df3 = pd.merge(df2,df)
df3.to_csv(filename+'UPDATED.csv')
The missing columns are going to be added at the end of the dataframe.
If you think this procedure might have drawbacks, or if there is another way to do it, let me know :)
Related
Wondering what the best way to tackle this issue is. If I have a DF with the following columns
df1()
type_of_fruit name_of_fruit price
..... ..... .....
and a list called
expected_cols = ['name_of_fruit','price']
whats the best way to automate the check of df1 against the expected_cols list? I was trying something like
df_cols=df1.columns.values.tolist()
if df_cols != expected_cols:
And then try to drop to another df any columns not in expected_cols, but this doesn't seem like a great idea to me. Is there a way to save the "dropped" columns?
df2 = df1.drop(columns=expected_cols)
But then this seems problematic depending on column ordering, and also in cases where the columns could have either more values than expected, or less values than expected. In cases where there are less values than expected (ie the df1 only contains the column name_of_fruit) I'm planning on using
df1.reindex(columns=expected_cols)
But a bit iffy on how to do this programatically, and then how to handle the issue where there are more columns than expected.
You can use set difference using -:
Assuming df1 having cols:
In [542]: df1_cols = df1.columns # ['type_of_fruit', 'name_of_fruit', 'price']
In [539]: expected_cols = ['name_of_fruit','price']
In [541]: unwanted_cols = list(set(d1_cols) - set(expected_cols))
In [542]: df2 = df1[unwanted_cols]
In [543]: df1.drop(unwanted_cols, 1, inplace=True)
Use groupby along the columns axis to split the DataFrame succinctly. In this case, check whether the columns are in your list to form the grouper, and you can store the results in a dict where the True key gets the DataFrame with the subset of columns in the list and the False key has the subset of columns not in the list.
Sample Data
import pandas as pd
df = pd.DataFrame(data = [[1,2,3]],
columns=['type_of_fruit', 'name_of_fruit', 'price'])
expected_cols = ['name_of_fruit','price']
Code
d = dict(tuple(df.groupby(df.columns.isin(expected_cols), axis=1)))
# If you need to ensure columns are always there then do
#d[True] = d[True].reindex(expected_cols)
d[True]
# name_of_fruit price
#0 2 3
d[False]
# type_of_fruit
#0 1
Little new to Python, I am trying to merge two data-frame with columns similar. 2nd data-frame consists of 1 column different need to append that in new data-frame.
Detailed view of dataframes
Code Used :
df3 = pd.merge(df,df1[['Id','Value_data']],
on = 'Id')
df3 = pd.merge(df,df1[['Id','Value_data']],
on = 'Id', how='outer')
Getting Output csv as
Unnamed: 0 Id_x Number_x Class_x Section_x Place_x Name_x Executed_Date_x Version_x Value PartDateTime_x Cycles_x Id_y Mumber_y Class_y Section_y Place_y Name_y Executed_Date_y Version_y Value_data PartDateTime_y Cycles_y
whereas i dont want _x & _y i wanted the output to be :
Id Number Class Section Place Name Executed_Date Version Value Value_data PartDateTime Cycles
If i use df2=pd.concat([df,df1],axis=0,ignore_index=True)
then i will get values in the below mentioned format in all columns except Value_data; whereas Value_data would be empty column.
Id Number Class Section Place Name Executed_Date Version Value Value_data PartDateTime Cycles
Please help me with a solution for this. Thanks for your time.
I think easiest path is to make a temporary df, let's call it df_temp2 , which is a copy of df_2, with renamed column, then append it to df_1
df_temp2 = df_2.copy()
df_temp2.columns = ['..','..', .... 'value' ...]
then
df_total = df_1.append(df_temp2)
This provides you a total DataFrame with all the rows of DF_1 and DF_2. 'append()' method supports a few arguments, check the docs for more details.
--- Added --------
One other possible approach is to use pd.concat() function, which can work in the same way ad .append() method, like this
result = pd.concat([df_1, df_temp2])
In your case the two approaches would lead to similar performances. You can consider append() as a method written on top of pd.concat() but it is applied to a DF itself.
Full docs about concat() here: pd.Concat() docs
Hope this was helpful.
import pandas as pd
df =pd.read_csv('C:/Users/output_2.csv')
df1 pd.read_csv('C:/Users/output_1.csv')
df1_temp=df1[['Id','Cycles','Value_data']].copy()
df3=pd.merge(df,df1_temp,on = ['Id','Cycles'], how='inner')
df3=df3.drop(columns="Unnamed: 0")
df3.to_csv('C:/Users/output.csv')
This worked
I have two dataframes. One has two important labels that have some associated columns for each label. The second one has the same labels and more useful data for those same labels. I'm trying to replace the values in the first with the values of the second for each appropriate label. For example:
df = {'a':['x','y','z','t'], 'b':['t','x','y','z'], 'a_1':[1,2,3,4], 'a_2':[4,2,4,1], 'b_1':[1,2,3,4], 'b_2':[4,2,4,2]}
df_2 = {'n':['x','y','z','t'], 'n_1':[1,2,3,4], 'n_2':[1,2,3,4]}
I want to replace the values for n_1 and n_2 in a_1 and a_2 for a and b that are the same as n. So far i tried using the replace and map functions, and they work when I use them like this:
df.iloc[0] = df.iloc[0].replace({'a_1':df['a_1']}, df_2['n_1'].loc(df['a'].iloc[0])
I can make the substitution for one specific line, but if I try to put that in a for loop and change the numbers I get the error Cannot index by location index with a non-integer key. If I take the ilocs from there I get the original df unchanged and without any error messages. I get the same behavior when I use the map function. The way i tried to do the for loop and the map:
for i in df:
df.iloc[i] = df.iloc[i].replace{'a_1':df['a_1']}, df_2['n_1'].loc(df['a'].iloc[i])
df.iloc[i] = df.iloc[i].replace{'b_1':df['b_1']}, df_2['n_1'].loc(df['b'].iloc[i])
And so on. And for the map function:
for i in df:
df = df.map(df['b_1']}: df_2['n_1'].loc(df['b'].iloc[i])
df = df.map(df['a_1']}: df_2['n_1'].loc(df['a'].iloc[i])
I would like the resulting dataframe to have the same format as the first but with the values of the second, something like this:
df = {'a':['x','y','z','t'], 'b':['t','x','y','z'], 'an_1':[1,2,3,4], 'an_2':[1,2,3,4], 'bn_1':[1,2,3,4], 'bn_2':[1,2,3,4]}
where an and bn are the values for a and b when n is equal to a or b in the second dataframe.
Hope this is comprehensible.
I'm taking an online course and I'm stuck at a part where trying to plot based on column name. It's not part of the course to do this so there's not much guidance.
#column year rename
gdp_df.rename(columns=lambda x: x + "_GDP", inplace=True)
selfemployed_df.rename(columns=lambda x: x + "_SE", inplace=True)
salaried_df.rename(columns=lambda x: x + "_S", inplace=True)
#country column rename
gdp_df.rename(columns={"GDP per capita_GDP":"country"},inplace=True)
selfemployed_df.rename(columns={"Total self-employed (%)_SE":"country"}, inplace=True)
salaried_df.rename(columns={"Total salaried employees (%)_S":"country"}, inplace=True)
#merge the dataframe
merge_df = salaried_df.merge(gdp_df.merge(selfemployed_df, left_on='country', right_on='country', how='inner'), left_on='country', right_on='country',how='inner')
Now I'm stuck at the step where I plot by _S, _GDP or _SE. How would I do plot group by where column name contains _S or _GDP or _SE? Am I going about this all wrong?
Well, wrong... You can do things always in many ways...
However, if you want to merge all your datasets into one and ask yourself how to extract a subset of columns depending on those suffixes, you could do the following:
S_cols = merge_df.columns[merge_df.columns.str.endswith('_S')]
Then
merge_df[S_cols]
would give you a dataframe with only the columns ending with '_S' inside.
One way to plot would then simply be
merge_df[S_cols].plot()
You can imagine the same with the other endings.
What I do not understand: as you merge the dataframes by yourself, you already have the separated dataframes at hand, i.e.: gdp_df, selfemployed_df and salaried_df. So what's the roblem here?
So essentially I have a data frame with a bunch of columns, some of which I want to keep (stored in to_keep) and some other columns that I want to create categorical variables for using pandas.get_dummies (these are stored in to_change).
However, I can't seem to get the syntax of how to do this down, and all the examples I have seen (i.e here: http://blog.yhat.com/posts/logistic-regression-and-python.html), don't seem to help.
Here's what I have at present:
new_df = df.copy()
dummies= pd.get_dummies(new_df[to_change])
new_df = new_df[to_keep].join(dummies)
return new_df
Any help on where I am going wrong would be appreciated, as the problem I keep running into is that this only adds categorical variables for the first column in to_change.
Didn't understand the problem completely, I must say.
However, say your DataFrae is df, and you have a list of columns to_make_categorical.
The DataFrame with the non-categorical columns, is
wo_categoricals = df[[c for c in list(df.columns) if c not in to_make_categorical]]
The DataFrames of the categorical expansions are
categoricals = [pd.get_dummies(df[c], prefix=c) for c in to_make_categorical]
Now you could just concat them horizontally:
pd.concat([wo_categoricals] + categoricals, axis=1)