pandas.concat of multiple data frames using only common columns - python

I have multiple pandas data frame objects cost1, cost2, cost3 ....
They have different column names (and number of columns) but have some in common.
Number of columns is fairly large in each data frame, hence handpicking the common columns manually will be painful.
How can I append rows from all of these data frames into one single data frame
while retaining elements from only the common column names?
As of now I have
frames=[cost1,cost2,cost3]
new_combined = pd.concat(frames, ignore_index=True)
This obviously contains columns which are not common across all data frames.

For future readers, Above functionality can be implemented by pandas itself.
Pandas can concat dataframe while keeping common columns only, if you provide join='inner' argument in pd.concat. e.g.
pd.concat(frames,join='inner', ignore_index=True)

You can find the common columns with Python's set.intersection:
common_cols = list(set.intersection(*(set(df.columns) for df in frames)))
To concatenate using only the common columns, you can use
pd.concat([df[common_cols] for df in frames], ignore_index=True)

Related

Select 2 different set of columns from column multiindex dataframe

I have the following column multiindex dataframe.
I would like to select (or get a subset) of the dataframe with different columns of each level_0 index (i.e. x_mm and y_mm from virtual and z_mm rx_deg ry_deg rz_deg from actual). From what I have read I think I might be able to use pandas IndexSlice but not entire sure how to use it in this context.
So far my work around is to use pd.concat selecting the 2 sets of columns independently. I have the feeling that this can be done neatly with slicing.
You can programmatically generate the tuples to slice your MultiIndex:
from itertools import product
cols = ((('virtual',), ('x_mm', 'y_mm')),
(('actual',), ('z_mm', 'rx_deg', 'ry_deg', 'rz_deg'))
)
out = df[[t for x in cols for t in product(*x)]]

Pandas, when merging two dataframes and values for some columns don't carry over

I'm trying to combine two dataframes together in pandas using left merge on common columns, only when I do that the data that I merged doesn't carry over and instead gives NaN values. All of the columns are objects and match that way, so i'm not quite sure whats going on.
this is my first dateframe header, which is the output from a program
this is my second data frame header. the second df is a 'key' document to match the first output with its correct id/tastant/etc and they share the same date/subject/procedure/etc
and this is my code thats trying to merge them on the common columns.
combined = first.merge(second, on=['trial', 'experiment','subject', 'date', 'procedure'], how='left')
with output (the id, ts and tastant columns should match correctly with the first dataframe but doesn't.
Check your dtypes, make sure they match between the 2 dataframes. Pandas makes assumptions about data types when it imports, it could be assuming numbers are int in one dataframe and object in another.
For the string columns, check for additional whitespaces. They can appear in datasets and since you can't see them and Pandas can, it result in no match. You can use df['column'].str.strip().
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.strip.html

Pandas Dataframes: Combining Columns from Two Global Datasets when the rows hold different Countries

My Problem is that these two CSV files have different countries at different rows, so I can't just append the column in question to the other data frame.
https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_recovered_global.csv
https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv
I'm trying to think of some way to use a for loop, checking every row, and add the recovered cases to the correct row where the country name is the same in both data frames, but I don't know how to put that idea in to code. Help?
You can do this a couple of ways:
Option 1: use pd.concat with set_index
pd.concat([df_confirmed.set_index(['Province/State', 'Country/Region']),
df_recovered.set_index(['Province/State', 'Country/Region'])],
axis=1, keys=['Confirmed', 'Recovered'])
Option 2: use pd.DataFrame.merge with an left join or outer join using how parameter
df_confirmed.merge(df_recovered, on=['Province/State', 'Country/Region'], how='left',
suffixes=('_confirmed','_recovered'))
Using pd.read_csv from github raw format:
df_recovered = pd.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_recovered_global.csv')
df_confirmed = pd.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv')

Can I get concat() to ignore column names and work only based on the position of the columns?

The docs , at least as of version 0.24.2, specify that pandas.concat can ignore the index, with ignore_index=True, but
Note the index values on the other axes are still respected in the
join.
Is there a way to avoid this, i.e. to concatenate based on the position only, and ignoring the names of the columns?
I see two options:
rename the columns so they match, or
convert to numpy, concatenate in
numpy, then from numpy back to pandas
Are there more elegant ways?
For example, if I want to add the series s as an additional row to the dataframe df, I can:
convert s to frame
transpose it
rename its columns so they are the
same as those of df
concatenate
It works, but it seems very "un-pythonic"!
A toy example is below; this example is with a dataframe and a series, but the same concept applies with two dataframes.
import pandas as pd
df=pd.DataFrame()
df['a']=[1]
df['x']='this'
df['y']='that'
s=pd.Series([3,'txt','more txt'])
st=s.to_frame().transpose()
st.columns=df.columns
out= pd.concat( [df, st] , axis=0, ignore_index=True)
In the case of 1 dataframe and 1 series, you can do:
df.loc[df.shape[0], :] = s.values

pandas combining 2 dataframes with different date indices

Let's say I've pulled csv data from two seperate files containing a date index that pandas automatically pulled which was one of the original columns.
import pandas as pd
df1 = pd.io.parsers.read_csv(data1, parse_dates = True, infer_datetime_format=True, index_col=0, names=['A'])
df2 = pd.io.parsers.read_csv(data2, parse_dates = True, infer_datetime_format=True, index_col=0, names=['A'])
Now the dates for one csv file are different than the other, but when loaded with read_csv, the dates are well defined. I've tried the join command, but it doesn't seem to preserve the dates.
df1 = df1.join(df2)
I get a valid data frame, but the range of the dates is fixed to some smaller subset of what the original range should be given the disparity between the dates for the two csv files. What I would like is a way to create a single dataframe with 2 columns (both 'A' columns) that contains NaN or zero values for the non overlapping dates filled in automatically. Is there a simple solution for this or is there something that I might be missing here. Thanks so much.
By default, pandas DataFrame method 'join' combines two dataframes using 'inner' merging. You want to use 'outer' merging. Your join line should read:
df1 = df1.join(df2, how='outer')
See http://pandas.pydata.org/pandas-docs/version/0.13.1/generated/pandas.DataFrame.join.html

Categories