Merging Two Dataframes without a Key Column - python

I have a requirement where I want to merge two data frames without any key column.
From the input table, I am treating first three columns as one data frame and the last column as another one. My plan is to sort the second data frame and then merge it to the first one without any key column so that it looks like the above output.
Is it possible to merge in this way or if there are any alternatives?

One way is to use pd.DataFrame.join after filtering out null values.
Data from #ALollz.
import pandas as pd
df1 = pd.DataFrame({'Country': ['USA', 'UK', 'Finland', 'Spain', 'Australia']})
df2 = pd.DataFrame({'Comments': ['X', None, 'Y', None, 'Z']})
res = df1.join(pd.DataFrame(list(filter(None, df2.values)), columns=['comments']))
Result:
Country comments
0 USA X
1 UK Y
2 Finland Z
3 Spain NaN
4 Australia NaN

If by "sort the second dataframe" you mean move the NULL values to the end of the list and keep the rest of the order in tact, then this will get the job done.
import pandas as pd
df1 = pd.DataFrame({'Country': ['USA', 'UK', 'Finland', 'Spain', 'Australia'],
'Name': ['Sam', 'Chris', 'Jeff', 'Kartik', 'Mavenn']})
df2 = pd.DataFrame({'Comments': ['X', None, 'Y', None, 'Z']})
df1['Comments'] = df2[df2.Comments.notnull()].reset_index().drop(columns='index')
Country Name Comments
0 USA Sam X
1 UK Chris Y
2 Finland Jeff Z
3 Spain Kartik NaN
4 Australia Mavenn NaN

IIUC:
input['Comments'] = input.Comments.sort_values().values
Output:
Comments Country Name
1 X USA Sam
2 Y UK Chris
3 Z Finland Jeff
4 NaN Spain Kartik
5 NaN Australia Maven

Related

How to implode (reverse of explode) only non-null values in pandas. Merge multiple rows into single row using pandas group by

I am working on Python Pandas.
I have a pandas dataframe with columns like this:
ID
Cities
1
New York
1
''
1
Atlanta
2
Tokyo
2
Kyoto
2
''
3
Paris
3
Bordeaux
3
''
4
Mumbai
4
''
4
Bangalore
5
London
5
''
5
Bermingham
Note the empty cells in the column are either empty string ('') or Nan or None. (For simplicity lets just say they are empty strings ('')).
And I want the result to be like this:
ID
Cities
1
New York, Atlanta
2
Tokyo, Kyoto
3
Paris, Bordeaux
4
Mumbai, Bangalore
5
London, Bermingham
In short, I want to group by ID and then getting the list (by removing the empty strings).
I have a sample code for this but it actually gives me result with empty strings, I want to remove empty strings.
dataFrame.groupby(['ID'], as_index=False)
.agg({'Cities': lambda x: x.tolist()})
It gives me result like this:
ID
Cities
1
New York, ,Atlanta
2
Tokyo, Kyoto,
3
Paris, Bordeaux,
4
Mumbai, , Bangalore
5
London, , Bermingham
But I dont want empty strings...
Please help me here.
Thank you so much for you help.
You can try replacing empty string by NaN and then add .dropna() to the aggregate lambda function, as follows:
df['Cities'] = df['Cities'].replace('', np.nan)
(df.groupby('ID', as_index=False)
.agg({'Cities': lambda x: x.dropna().tolist()})
)
Result:
ID Cities
0 1 [New York, Atlanta]
1 2 [Tokyo, Kyoto]
2 3 [Paris, Bordeaux]
3 4 [Mumbai, Bangalore]
4 5 [London, Bermingham]
We can also perform the operations at the Series level, by mask out the unneeded values like empty string (''), dropna to remove the missing/empty values, then groupby aggregate into whatever type needed, like a list:
new_df = (
df['Cities']
.mask(df['Cities'].eq("")) # Replace Empty String with NaN
.dropna() # Exclude NaN
.groupby(df['ID']) # Groupby ID
.aggregate(list) # Join Into List
.reset_index() # Convert Back to DataFrame
)
Or filter out unneeded rows by condition:
new_df = (
# Filter out by condition
df.loc[df['Cities'].ne("") & df['Cities'].notnull(), 'Cities']
.groupby(df['ID']) # Groupby ID
.aggregate(list) # Join Into List
.reset_index() # Convert Back to DataFrame
)
new_df:
ID Cities
0 1 [New York, Atlanta]
1 2 [Tokyo, Kyoto]
2 3 [Paris, Bordeaux]
3 4 [Mumbai, Bangalore]
4 5 [London, Bermingham]
Setup:
import pandas as pd
df = pd.DataFrame({
'ID': [1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5],
'Cities': ['New York', "", 'Atlanta', 'Tokyo', 'Kyoto', "", 'Paris',
'Bordeaux', "", 'Mumbai', "", 'Bangalore', 'London', "",
'Bermingham']
})

How to turn header inside rows into columns?

How do I turn the headers inside the rows into columns?
For example I have the Dataframe below.
enter image description here
and would like it to be
enter image description here
EDIT:
Code to produce current df example
import pandas as pd
df = pd.DataFrame({'Date':[2020,2021,2022], 'James':'', ' Sales': [3,4,5], ' City':'NY', ' DIV':'a', 'KIM':'', ' Sales ': [3,4,5], ' City ':'SF', ' DIV ':'b'}).T.reset_index()
index 0 1 2
0 Date 2020 2021 2022
1 James
2 Sales 3 4 5
3 City NY NY NY
4 DIV a a a
5 KIM
6 Sales 3 4 5
7 City SF SF SF
8 DIV b b b
looking to get
Name City DIV Account 2020 2021 2022
James NY a Sales 3 4 5
KIM SF b Sales 3 4 5
I think the best way is to iterate over the first column if the name(eg James) has no indent its turn into a column until it hits a other value (KIM). So to find a way to categories the header which is not indent into a new column which stops when a new header comes up (KIM).
#Edit 2 there not only two names (KIM or JAMES) there is like 20 names. Or only the three second levels (Sales, City, Div). Different names have more that 3 second levels some have 7 levels. The only thing that is consistent is the Names are not indent but the second levels are.
Using a slightly simpler example, this works, but it sure ain't pretty:
df = pd.DataFrame({
'date': ['James', 'Sales', 'City', 'Kim', 'Sales', 'City',],
'2020': ['', '3', 'NY', '', '4', 'SF'],
'2021': ['', '4', 'NY', '', '5', 'SF'],
})
def rows_to_columns(group):
for value in group.date.values:
if value != group.person.values[0] and value != 'Sales':
temp_column = '_'+value
group.loc[group['date']==value, temp_column] = group['2020']
group[value.lower()] = (
group[temp_column]
.fillna(method='ffill')
.fillna(method='bfill')
)
group.drop([temp_column], axis=1, inplace=True)
pass
pass
return group
df.loc[df['2020']=='', 'person'] = df.date
df.person = df.person.fillna(method='ffill')
new_df = (df
.groupby('person')
.apply(lambda x:rows_to_columns(x))
.drop(['date'], axis=1)
.loc[df.date=='Sales']
)
The basic idea is to
Copy the name into a separate column and fill that column using .fillna(method='ffill'). This works if the assumption holds that every person's block begins with the person's name. Otherwise it wreaks havoc.
All other values, such as 'div' and 'city' will be converted by row_to_columns(group). The function iterates over all rows in a group that are neither the person's name nor 'Sales', copies the value from the row into a temp column, creates a new column for that row and uses ffill and bfill to fill it out. It then deletes the temp column and returns the group.
The resulting data frame is the intended format once the column 'Sales' is dropped.
Note: This solution probably does not work well on larger datasets.
You gave more details, and I see you are not working with multi-level indexes. The best way for you would be to create the DataFrame already in the format you need in this case. The way you are creating the first DataFrame is not well structured and the information is not indexed by name (James/KIM) as they are columns with empty values, no link with the other values. The stacking you did use blank spaces on a string. Take a look at multi-indexing and generate a data frame you can work with, or create the data frame in the format you need in the end.
-- Answer considering multi-level indexes --
Using the few information provided, I see your Dataframe is stacked, it means, you have multiple indexes. The first level is person (James/KIM) and the second level is Sales/City/DIV. So your Dataframe should be created like this:
import pandas
multi_index = pandas.MultiIndex.from_tuples([
('James', 'Sales'), ('James', 'City'), ('James', 'DIV'),
('KIM', 'Sales'), ('KIM', 'City'), ('KIM', 'DIV')])
year_2020 = pandas.Series([3, 'NY', 'a', 4, 'SF', 'b'], index=multi_index)
year_2021 = pandas.Series([4, 'NY', 'a', 5, 'SF', 'b'], index=multi_index)
year_2022 = pandas.Series([5, 'NY', 'a', 6, 'SF', 'b'], index=multi_index)
frame = { '2020': year_2020, '2021': year_2021, '2022': year_2022}
df = pandas.DataFrame(frame)
print(df)
2020 2021 2022
James Sales 3 4 5
City NY NY NY
DIV a a a
KIM Sales 4 5 6
City SF SF SF
DIV b b b
Now that you have the multi_level DataFrame, you have many ways to transform it. This is what we will do to make it one level:
sales_df = df.xs('Sales', axis=0, level=1).copy()
div_df = df.xs('DIV', axis=0, level=1).copy()
city_df = df.xs('City', axis=0, level=1).copy()
The results will be:
print(sales)
2020 2021 2022
James 3 4 5
KIM 4 5 6
print(div_df)
2020 2021 2022
James a a a
KIM b b b
print(city_df)
2020 2021 2022
James NY NY NY
KIM SF SF SF
You are discarding any information regarding DIV or City changes from years, so we can reduce the City and DIV dataframe to a Series, taking the first one as reference:
div_series = div_df.iloc[:,0]
city_series = city_df.iloc[:,0]
Take the sales DF as reference, and add the City and DIV series:
sales_df['DIV'] = div_series
sales_df['City'] = city_series
sales_df['Account'] = 'Sales'
Now reorder the columns as you wish:
sales_df = sales_df[['City', 'DIV', 'Account', '2020', '2021', '2022']]
print(sales_df)
City DIV Account 2020 2021 2022
James NY a Sales 3 4 5
KIM SF b Sales 4 5 6

Categorize column according to lists and aggregate with result

Let's say I have a dataframe as follows:
d = {'name': ['spain', 'greece','belgium','germany','italy'], 'davalue': [3, 4, 6, 9, 3]}
df = pd.DataFrame(data=d)
index name davalue
0 spain 3
1 greece 4
2 belgium 6
3 germany 9
4 italy 3
I would like to aggregate and sum based on a list of strings in the name column. So for example, I may have: southern=['spain', 'greece', 'italy'] and northern=['belgium','germany'].
My goal is to aggregate by using sum, and obtain:
index name davalue
0 southern 10
1 northen 15
where 10=3+4+3 and 15=6+9
I imagined something like:
df.groupby(by=[['spain','greece','italy'],['belgium','germany']])
could exist. The docs say
A label or list of labels may be passed to group by the columns in self
but I'm not sure I understand what that means in terms of syntax.
I would build a dictionary and map:
d = {v:'southern' for v in southern}
d.update({v:'northern' for v in northern})
df['davalue'].groupby(df['name'].map(d)).sum()
Output:
name
northern 15
southern 10
Name: davalue, dtype: int64
One way could be using np.select and using the result as a grouper:
import numpy as np
southern=['spain', 'greece', 'italy']
northern=['belgium','germany']
g = np.select([df.name.isin(southern),
df.name.isin(northern)],
['southern', 'northern'],
'others')
df.groupby(g).sum()
davalue
northern 15
southern 10
df["regional_group"]=df.apply(lambda x: "north" if x["home_team_name"] in ['belgium','germany'] else "south",axis=1)
You create a new column by which you later groubpy.
df.groupby("regional_group")["davavalue"].sum()

Sorting Pandas dataframe with variable columns

I have an arbitrary number of data frames (3 in this case). I am trying to pick out the trip with the highest speed between the starting destination (column A) and the final destination (column variable). These trips need to be stored in a new dataframe.
d= {'A':['London', 'London', 'London', 'London', 'Budapest'], 'B':
['Beijing', 'Sydney', 'Warsaw', 'Budapest', 'Warsaw'],'Speed':
[1000,2000,500,499,500]}
df = pd.DataFrame(data=d)
d1= {'A':['London', 'London', 'London', 'Budapest'], 'B':['Rio', 'Rio',
'Rio', 'Rio'],'C':['Beijing', 'Sydney', 'Budapest', 'Warsaw'],'Speed':
[2000,1000,500,500]}
df1= pd.DataFrame(data=d1)
d2= {'A':['London', 'London', 'London', 'London'],'B':['Florence',
'Florence', 'Florence', 'Florence'],'C':['Rio', 'Rio', 'Rio', 'Rio'], 'D':
['Beijing', 'Sydney', 'Oslo', 'Warsaw'],'Speed':[500,500,500,1000]}
df2= pd.DataFrame(data=d2)
The desired output for this particular case would look like this:
A B C D Speed
London Rio Beijing NaN 2000
London Sydney NaN NaN 2000
London Florence Rio Warsaw 1000
London Florence Rio Oslo 500
London Rio Budapest NaN 500
Budapest Warsaw NaN NaN 500
I started by appending the dataframes with:
df.append(df1).append(df2)
First join all DataFrames toghether and sort by column Speed. Then filter by boolean mask with ffill for forward filling missing values with duplicated:
df = pd.concat([df, df1, df2]).sort_values('Speed', ascending=False)
df = df[~df.ffill(axis=1).duplicated(['A','D'])].reset_index(drop=True)
print (df)
A B C D Speed
0 London Sydney NaN NaN 2000
1 London Rio Beijing NaN 2000
2 London Florence Rio Warsaw 1000
3 Budapest Warsaw NaN NaN 500
4 London Rio Budapest NaN 500
5 London Florence Rio Oslo 500
You can sort the data frames by using values or index. For example, if you want to sort by column B - you can write code as below:
For single column
`df.sort_values(by=['B'])`
Sort by multiple column
df.sort_values(by=['col1', 'col2'])
You can also sort by the index values.

Cells all becomes NaN after reordering alphabetically

After I tried to sort my Pandas dataframe by the country column with:
times_data2.reindex_axis(sorted(times_data2['country']), axis=1)
My dataframe became something like:
Argetina Argentina .... United States of America ...
NaN Nan .... NaN ....
If you want to set the index of the dataframe to sorted countries:
df = pd.DataFrame({'country': ['Brazil', 'USA', 'Argentina'], 'val': [1, 2, 3]})
>>> df
country val
0 Brazil 1
1 USA 2
2 Argentina 3
>>> df.set_index('country').sort_index()
val
country
Argentina 3
Brazil 1
USA 2
You may want to transpose these results:
>>> df.set_index('country').sort_index().T
country Argentina Brazil USA
val 3 1 2
If you want to sort by a column, use .sort_values():
times_data2.sort_values(by='country')
Then use .set_index('country') if necessary.

Categories