I'm taking an online course and I'm stuck at a part where trying to plot based on column name. It's not part of the course to do this so there's not much guidance.
#column year rename
gdp_df.rename(columns=lambda x: x + "_GDP", inplace=True)
selfemployed_df.rename(columns=lambda x: x + "_SE", inplace=True)
salaried_df.rename(columns=lambda x: x + "_S", inplace=True)
#country column rename
gdp_df.rename(columns={"GDP per capita_GDP":"country"},inplace=True)
selfemployed_df.rename(columns={"Total self-employed (%)_SE":"country"}, inplace=True)
salaried_df.rename(columns={"Total salaried employees (%)_S":"country"}, inplace=True)
#merge the dataframe
merge_df = salaried_df.merge(gdp_df.merge(selfemployed_df, left_on='country', right_on='country', how='inner'), left_on='country', right_on='country',how='inner')
Now I'm stuck at the step where I plot by _S, _GDP or _SE. How would I do plot group by where column name contains _S or _GDP or _SE? Am I going about this all wrong?
Well, wrong... You can do things always in many ways...
However, if you want to merge all your datasets into one and ask yourself how to extract a subset of columns depending on those suffixes, you could do the following:
S_cols = merge_df.columns[merge_df.columns.str.endswith('_S')]
Then
merge_df[S_cols]
would give you a dataframe with only the columns ending with '_S' inside.
One way to plot would then simply be
merge_df[S_cols].plot()
You can imagine the same with the other endings.
What I do not understand: as you merge the dataframes by yourself, you already have the separated dataframes at hand, i.e.: gdp_df, selfemployed_df and salaried_df. So what's the roblem here?
Related
I want to join Dataframes with same column names into a single dataframe. When I press " df = pd.concat([df1, df2], ignore_index=True, sort=False)", I get an extra column. Please help
enter image description here
This is how I am Getting the Concatenated Dataframe
I think problem is State column has space in one DataFrame like State or State , so after concat was not join with State column.
Solution is:
#test columns names
print (df1.columns)
print (df2.columns)
#remove trailing spaces
df1.columns = df1.columns.str.strip()
df2.columns = df2.columns.str.strip()
df = pd.concat([df1, df2], ignore_index=True, sort=False)
It looks like it's not recognizing the "state" as the same field. I'd probably start by taking a look at the setup of the state field in each table to see why it thinks they're different. If you can find a different, format them to be the same and then try again.
I'm trying to create a DataFrame out of two existing ones. I read the title of some articles in the web, first column is title and the ones after are timestamps
i want to concat both data frames but leave out the ones with the same title (column one)
I tried
df = pd.concat([df1,df2]).drop_duplicates().reset_index(drop=True)
but because the other columns may not be the exact same all the time, I need to leave out every data pack that has the same first column. how would I do this?
btw sorry for not knowing all the right terms for my problem
You should first remove the duplicate rows from df2 and then concat it with df1:
df = pd.concat([df1, df2[~df2.title.isin(df1.title)]]).reset_index(drop=True)
This probably solves your problem:
import pandas as pd
import numpy as np
df=pd.DataFrame(np.arange(2*5).reshape(2,5))
df2=pd.DataFrame(np.arange(2*5).reshape(2,5))
df.columns=['blah1','blah2','blah3','blah4','blah']
df2.columns=['blah5','blah6','blah7','blah8','blah']
for i in range(len(df.columns)):
for j in range(len(df2.columns)):
if df.columns[i] == df2.columns[j]:
df2 = df2.drop(df2.columns[j], axis = 1)
else:
continue
print(pd.concat([df, df2], axis =1))
I am using Pandas with PsychoPy to reorder my results in a dataframe. The problem is that the dataframe is going to vary according to the participant performance. However, I would like to have a common dataframe, where non-existing columns are created as empty. Then the columns have to be in a specific order in the output file.
Let´s suppose I have a dataframe from a participant with the following columns:
x = ["Error_1", "Error_2", "Error_3"]
I want the final dataframe to look like this:
x = x[["Error_1", "Error_2", "Error_3", "Error_4"]]
Where "Error_4" is created as an empty column.
I tried applying something like this (adapted from another question):
if "Error_4" not in x:
x["Error_4"] = ""
x = x[["Error_1", "Error_2", "Error_3", "Error_4"]]
In principle it should work, however I have more or less other 70 columns for which I should do this, and it doesn´t seem practical to do it for each of them.
Do you have any suggestions?
I also tried creating a new dataframe with all the possible columns, e.g.:
y = ["Error_1", "Error_2", "Error_3", "Error_4"]
However, it is still not clear to me how to merge the dataframes x and y skipping columns with the same header.
Use DataFrame.reindex:
x = x.reindex(["Error_1", "Error_2", "Error_3", "Error_4"], axis=1, fill_value='')
Thanks for the reply, I followed your suggestion and adapted it. I post it here, since it may be useful for someone else.
First I create a dataframe y as I want my output to look like:
y = ["Error_1", "Error_2", "Error_3", "Error_4", "Error_5", "Error_6"]
Then, I get my actual output file df and modify it as df2, adding to it all the columns of y in the exact same order.
df = pd.DataFrame(myData)
columns = df.columns.values.tolist()
df2 = df.reindex(columns = y, fill_value='')
In this case, all the columns that are absent in df2 but are present in y, are going to be added to df2.
However, let´s suppose that in df2 there is a column "Error_7" absent in y. To keep track of these columns I am just applying merge and creating a new dataframe df3:
df3 = pd.merge(df2,df)
df3.to_csv(filename+'UPDATED.csv')
The missing columns are going to be added at the end of the dataframe.
If you think this procedure might have drawbacks, or if there is another way to do it, let me know :)
I apologize in advance if this has been covered, I could not find anything quite like this. This is my first programming job (I was previously software QA) and I've been beating my head against a wall on this.
I have 2 dataframes, one is very large [df2] (14.6 million lines) and I am iterating through it in chunks. I attempted to compare a column of the same name in each dataframe, if they're equal I would like to output a secondary column of the larger frame.
i.e.
if df1['tag'] == df2['tag']:
df1['new column'] = df2['plate']
I attempted a merge but this didn't output what I expected.
df3 = pd.merge(df1, df2, on='tag', how='left')
I hope I did an okay job explaining this.
[Edit:] I also believe I should mention that df2 and df1 both have many additional columns I do not want to interact with/change. Is it possible to only compare the single columns of two dataframes, and output the third additional column?
You may try inner merge. First, you may inner merge df1 with df2 and then you will get plates only for common rows and you can rename new df1's column as per your need
df1 = df1.merge(df2, on="tag", how = 'inner')
df1['new column'] = df1['plate']
del df1['plate']
I hope this works.
As smci mentioned, this is a perfect time to use join/merge. If you're looking to preserve df1, a left join is what you want. So you were on the right path:
df1 = pd.merge(df1['tag'],
df2['tag', 'plate'],
on='tag', how='left')
df1.rename({'plate': 'new column'}, axis='columns')
That will only compare the tag columns in each dataframe, so the other columns won't matter. It'll bring over the plate column from df2, and then renames it to whatever you want your new column to be named.
This is totally a case for join/merge. You want to put df2 on the left because it's smaller.
df2.join(df1, on='tag', ...)
You only misunderstood the type of join/merge) you want to make:
how : {‘left’, ‘right’, ‘outer’, ‘inner’}, default: ‘left’
'how'='left' join would create an (unwanted) entry for all rows of the LHS df2. That's not quite what you want (if df2 contained other tag values not seen in df1, you'd also get entries for them).
'how'='inner' would form the intersection of df2 and df1 on the 'on'='tag' field. i.e. you only get entries for where df1 contains a valid tag value according to df2.
So:
df3 = df2.join(df1, on='tag', how='inner')
# then reference df3['plate']
or if you only want the 'plate' column in df3 (or some other selection of columns), you can directly do:
df2.join(df1, on='tag', how='inner') ['plate']
So essentially I have a data frame with a bunch of columns, some of which I want to keep (stored in to_keep) and some other columns that I want to create categorical variables for using pandas.get_dummies (these are stored in to_change).
However, I can't seem to get the syntax of how to do this down, and all the examples I have seen (i.e here: http://blog.yhat.com/posts/logistic-regression-and-python.html), don't seem to help.
Here's what I have at present:
new_df = df.copy()
dummies= pd.get_dummies(new_df[to_change])
new_df = new_df[to_keep].join(dummies)
return new_df
Any help on where I am going wrong would be appreciated, as the problem I keep running into is that this only adds categorical variables for the first column in to_change.
Didn't understand the problem completely, I must say.
However, say your DataFrae is df, and you have a list of columns to_make_categorical.
The DataFrame with the non-categorical columns, is
wo_categoricals = df[[c for c in list(df.columns) if c not in to_make_categorical]]
The DataFrames of the categorical expansions are
categoricals = [pd.get_dummies(df[c], prefix=c) for c in to_make_categorical]
Now you could just concat them horizontally:
pd.concat([wo_categoricals] + categoricals, axis=1)