PYTHON/PANDAS - Reindexing on multiple indexes - python

I have a dataframe similar to what follows:
test = {"id": ["A", "A", "A", "B", "B", "B"],
"date": ["09-02-2013", "09-03-2013", "09-05-2013", "09-15-2013", "09-17-2013", "09-18-2013"],
"country": ["Poland", "Poland", "France", "Scotland", "Scotland", "Canada"]}
and I want a table which returns this :
id
date
country
A
09-02-2013
Poland
A
09-03-2013
Poland
A
09-04-2013
Poland
A
09-05-2013
France
B
09-15-2013
Scotland
B
09-16-2013
Scotland
B
09-17-2013
Scotland
B
09-18-2013
Canada
i.e. a table that fills in any date that I am missing but will only do it to the min/max of each id
I have looked around stack overflow but usually this problem just has one index or the person wants to drop an index anyway
This is what I have got so far:
test_df = pd.DataFrame(test)
# get min date per id
dates = test_df.groupby("id")["date"].min().to_frame(name="min")
# get max date
dates["max"] = test_df.groupby("id")["date"].max().to_frame(name="max")
midx = pd.MultiIndex.from_frame(dates.apply(lambda x: pd.date_range(x["min"], x["max"], freq="D"), axis=1).explode().reset_index(name="date")[["date", "id"]])
test_df = test_df.set_index(["date", "id"])
test_df = test_df.reindex(midx).fillna(method="ffill")
test_df
Which gets me really close but not quite there, with the dates all there but no country:
id
date
country
A
09-02-2013
NaN
A
09-03-2013
NaN
A
09-04-2013
NaN
A
09-05-2013
NaN
B
09-15-2013
NaN
B
09-16-2013
NaN
B
09-17-2013
NaN
B
09-18-2013
NaN
Any ideas on how to fix it?

IIUC, you could generate a date_range per group, explode, then merge and ffill the values per group:
out = (test_df
.merge(pd
.to_datetime(test_df['date'], dayfirst=False)
.groupby(test_df['id'])
.apply(lambda g: pd.date_range(g.min(), g.max(), freq='D'))
.explode().dt.strftime('%m-%d-%Y')
.reset_index(name='date'),
how='right'
)
.assign(country=lambda d: d.groupby('id')['country'].ffill())
)
output:
id date country
0 A 09-02-2013 Poland
1 A 09-03-2013 Poland
2 A 09-04-2013 Poland
3 A 09-05-2013 France
4 B 09-15-2013 Scotland
5 B 09-16-2013 Scotland
6 B 09-17-2013 Scotland
7 B 09-18-2013 Canada

Related

Python Dataframe: pivot rows as columns

I have raw files from different stations. When I combine them into a dataframe, I see three columns with matching id and name with different component. I want to convert this into a dataframe where name entries become the column names
Code:
df =
id name component
0 1 Serial Number 103
1 2 Station Name DC
2 1 Serial Number 114
3 2 Station Name CA
4 1 Serial Number 147
5 2 Station Name FL
Expected answer:
new_df =
Station Name Serial Number
0 DC 103
1 CA 114
2 FL 147
My answer:
# Solution1
df.pivot_table('id','name','component')
name
NaN NaN NaN NaN
# Solution2
df.pivot(index=None,columns='name')['component']
name
NaN NaN NaN NaN
I am not getting desired answer. Any help?
First you have to make every 2 rows with the same id, after that you can use pivot table.
import pandas as pd
df = pd.DataFrame({'id': ["1", "2", "1", "2", "1", "2"],
'name': ["Serial Number", "Station Name", "Serial Number", "Station Name", "Serial Number", "Station Name"],
'component': ["103", "DC", "114", "CA", "147", "FL"]})
new_column = [x//2+1 for x in range(len(df))]
df["id"] = new_column
df = df.pivot(index='id',columns='name')['component']
If your Serial Number is just before Station Name, you can pivot on name columns then combine the every two rows:
df_ = df.pivot(columns='name', values='component').groupby(df.index // 2).first()
print(df_)
name Serial Number Station Name
0 103 DC
1 114 CA
2 147 FL

Fill missing values in a dataframe based on other cell values

I have a large list of names and I am trying to cull down the duplicates. I am grouping them by name and consolidating the info if need be.
When two people don't have the same name it is no problem, we can just ffill and bfill, however, if two people have the same name we need to do some extra checks
This is an example of a group:
name code id country yob
1137 Bobby Joe USA19921111 NaN NaN NaN
2367 Bobby Joe NaN 1223133121 USA 1992
4398 Bobby Joe USA19981111 NaN NaN NaN
The code contains the persons country and birthdate. Looking at it, we can see that the first and second row are the same person. So we need to fill the info from the second row into the first row:
name code id country yob
1137 Bobby Joe USA19921111 1223133121 USA 1992
4398 Bobby Joe USA19981111 NaN NaN NaN
Here is what I have:
# Get create a dictionry of all of the rows that contain
# codes and their indexes
code_rows = dict(zip(list(group['code'].dropna().index),
group['code'].dropna().unique()))
no_code_rows = group.loc[pd.isnull(group['code']), :]
if no_code_rows.empty or len(code_rows) == group.shape[0]:
# No info to consolidate
return group
for group_idx, code in code_rows.items():
for row_idx, row in no_code_rows.iterrows():
country_yob = row['country'] + str(int(row['yob']))
if country_yob in code:
group.loc[group_idx, 'id'] = row['id']
group.loc[group_idx, 'country'] = row['country']
group.loc[group_idx, 'yob'] = row['yob']
group.drop(row_idx, inplace=True)
# Drop from temp table so we don't have to iterate
# over an extra row
no_code_rows.drop(row_idx, inplace=True)'''
break
return group
This works but I have a feeling I am missing something? I feel like I shouldn't have to use two loops for this and that maybe there is a pandas function?
EDIT
We don't know the order or how many rows we will have in each group
i.e.
name code id country yob
1137 Bobby Joe USA19921111 NaN NaN NaN
2367 Bobby Joe USA19981111 NaN NaN NaN
4398 Bobby Joe NaN 1223133121 USA 1992`
I think need:
m = df['code'].isnull()
df1 = df[~m]
df2 = df[m]
df = df1.merge(df2, on='name', suffixes=('','_'))
df['a_'] = df['country_'] + df['yob_'].astype(str)
m = df.apply(lambda x: x['a_'] in x['code'], axis=1)
df.loc[m, ['id','country','yob']] = df.loc[m, ['id_','country_','yob_']].rename(columns=lambda x: x.strip('_'))
df = df.loc[:, ~df.columns.str.endswith('_')]
print (df)
name code id country yob
0 Bobby Joe USA19921111 1223133121 USA 1992
1 Bobby Joe USA19981111 NaN NaN NaN

Merging Two Dataframes without a Key Column

I have a requirement where I want to merge two data frames without any key column.
From the input table, I am treating first three columns as one data frame and the last column as another one. My plan is to sort the second data frame and then merge it to the first one without any key column so that it looks like the above output.
Is it possible to merge in this way or if there are any alternatives?
One way is to use pd.DataFrame.join after filtering out null values.
Data from #ALollz.
import pandas as pd
df1 = pd.DataFrame({'Country': ['USA', 'UK', 'Finland', 'Spain', 'Australia']})
df2 = pd.DataFrame({'Comments': ['X', None, 'Y', None, 'Z']})
res = df1.join(pd.DataFrame(list(filter(None, df2.values)), columns=['comments']))
Result:
Country comments
0 USA X
1 UK Y
2 Finland Z
3 Spain NaN
4 Australia NaN
If by "sort the second dataframe" you mean move the NULL values to the end of the list and keep the rest of the order in tact, then this will get the job done.
import pandas as pd
df1 = pd.DataFrame({'Country': ['USA', 'UK', 'Finland', 'Spain', 'Australia'],
'Name': ['Sam', 'Chris', 'Jeff', 'Kartik', 'Mavenn']})
df2 = pd.DataFrame({'Comments': ['X', None, 'Y', None, 'Z']})
df1['Comments'] = df2[df2.Comments.notnull()].reset_index().drop(columns='index')
Country Name Comments
0 USA Sam X
1 UK Chris Y
2 Finland Jeff Z
3 Spain Kartik NaN
4 Australia Mavenn NaN
IIUC:
input['Comments'] = input.Comments.sort_values().values
Output:
Comments Country Name
1 X USA Sam
2 Y UK Chris
3 Z Finland Jeff
4 NaN Spain Kartik
5 NaN Australia Maven

Cells all becomes NaN after reordering alphabetically

After I tried to sort my Pandas dataframe by the country column with:
times_data2.reindex_axis(sorted(times_data2['country']), axis=1)
My dataframe became something like:
Argetina Argentina .... United States of America ...
NaN Nan .... NaN ....
If you want to set the index of the dataframe to sorted countries:
df = pd.DataFrame({'country': ['Brazil', 'USA', 'Argentina'], 'val': [1, 2, 3]})
>>> df
country val
0 Brazil 1
1 USA 2
2 Argentina 3
>>> df.set_index('country').sort_index()
val
country
Argentina 3
Brazil 1
USA 2
You may want to transpose these results:
>>> df.set_index('country').sort_index().T
country Argentina Brazil USA
val 3 1 2
If you want to sort by a column, use .sort_values():
times_data2.sort_values(by='country')
Then use .set_index('country') if necessary.

Compare PandaS DataFrames and return rows that are missing from the first one

I have 2 dataFrames and want to compare them and return rows from the first one (df1) that are not in the second one (df2). I found a way to compare them and return the differences, but can't figure out how to return only missing ones from df1.
import pandas as pd
from pandas import Series, DataFrame
df1 = pd.DataFrame( {
"City" : ["Chicago", "San Franciso", "Boston"] ,
"State" : ["Illinois", "California", "Massachusett"] } )
df2 = pd.DataFrame( {
"City" : ["Chicago", "Mmmmiami", "Dallas" , "Omaha"] ,
"State" : ["Illinois", "Florida", "Texas", "Nebraska"] } )
df = pd.concat([df1, df2])
df = df.reset_index(drop=True)
df_gpby = df.groupby(list(df.columns))
idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1]
blah = df.reindex(idx)
Building on #EdChum's suggestion:
df = pd.merge(df1, df2, how='outer', suffixes=('','_y'), indicator=True)
rows_in_df1_not_in_df2 = df[df['_merge']=='left_only'][df1.columns]
rows_in_df1_not_in_df2
|Index |City |State |
|------|------------|------------|
|1 |San Franciso|California |
|2 |Boston |Massachusett|
EDIT: incorporate #RobertPeters suggestion
IIUC then if you're using pandas version 0.17.0 then you can use merge and set indicator=True:
In [80]:
df1 = pd.DataFrame( {
"City" : ["Chicago", "San Franciso", "Boston"] ,
"State" : ["Illinois", "California", "Massachusett"] } )
​
df2 = pd.DataFrame( {
"City" : ["Chicago", "Mmmmiami", "Dallas" , "Omaha"] ,
"State" : ["Illinois", "Florida", "Texas", "Nebraska"] } )
pd.merge(df1,df2, how='outer', indicator=True)
Out[80]:
City State _merge
0 Chicago Illinois both
1 San Franciso California left_only
2 Boston Massachusett left_only
3 Mmmmiami Florida right_only
4 Dallas Texas right_only
5 Omaha Nebraska right_only
This adds a column to indicator whether the rows are only present in either lhs or rhs
You can also use a list comprehension and compare the rows to return the missing elements:
dif_list = [x for x in list(df1['City'].unique()) if x not in list(df2['City'].unique())]
returns:
['San Franciso', 'Boston']
You could then get a dataframe with just the rows that are different:
dfdif = df1[(df1['City'].isin(dif_list))]
If you're on pandas < 0.17.0
You could work your way up like
In [182]: df = pd.merge(df1, df2, on='City', how='outer')
In [183]: df
Out[183]:
City State_x State_y
0 Chicago Illinois Illinois
1 San Franciso California NaN
2 Boston Massachusett NaN
3 Mmmmiami NaN Florida
4 Dallas NaN Texas
5 Omaha NaN Nebraska
In [184]: df.ix[df['State_y'].isnull(),:]
Out[184]:
City State_x State_y
1 San Franciso California NaN
2 Boston Massachusett NaN

Categories