I am using a pandas DataFrame and I would like to pull one column value/index down by one. So the list Dataframe Length will be one less. Just like this in my example image:
The new DataFrame should be id 2-5, but of course re-index after the manipulation to 1-4. There are more than just name and place rows.
How can I quickly manipulate the DataFrame like this?
Thank you very much.
You can shift the name column and then take a slice using iloc:
In [55]:
df = pd.DataFrame({'id':np.arange(1,6), 'name':['john', 'bla', 'tim','walter','john'], 'place':['new york','miami','paris','rome','sydney']})
df
Out[55]:
id name place
0 1 john new york
1 2 bla miami
2 3 tim paris
3 4 walter rome
4 5 john sydney
In [56]:
df['name'] = df['name'].shift(-1)
df = df.iloc[:-1]
df
Out[56]:
id name place
0 1 bla new york
1 2 tim miami
2 3 walter paris
3 4 john rome
If your 'id' column is your index the above still works:
In [62]:
df = pd.DataFrame({'name':['john', 'bla', 'tim','walter','john'], 'place':['new york','miami','paris','rome','sydney']},index=np.arange(1,6))
df.index.name = 'id'
df
Out[62]:
name place
id
1 john new york
2 bla miami
3 tim paris
4 walter rome
5 john sydney
In [63]:
df['name'] = df['name'].shift(-1)
df = df.iloc[:-1]
df
Out[63]:
name place
id
1 bla new york
2 tim miami
3 walter paris
4 john rome
Related
In python, I have a df that looks like this
Name ID
Anna 1
Polly 1
Sarah 2
Max 3
Kate 3
Ally 3
Steve 3
And a df that looks like this
Name ID
Dan 1
Hallie 2
Cam 2
Lacy 2
Ryan 3
Colt 4
Tia 4
How can I merge the df’s so that the ID column looks like this
Name ID
Anna 1
Polly 1
Sarah 2
Max 3
Kate 3
Ally 3
Steve 3
Dan 4
Hallie 5
Cam 5
Lacy 5
Ryan 6
Colt 7
Tia 7
This is just a minimal reproducible example. My actual data set has 1000’s of values. I’m basically merging data frames and want the ID’s in numerical order (continuation of previous data frame) instead of repeating from one each time. I know that I can reset the index if ID is a unique identifier. But in this case, more than one person can have the same ID. So how can I account for that?
From the example that you have provided above, you can observe that we can obtain the final dataframe by: adding the maximum value of ID in first df to the second and then concatenating them, to explain this better:
Name df2 final_df
Dan 1 4
This value in final_df is obtained by doing a 1+(max value of ID from df1 i.e. 3) and this trend is followed for all entries for the dataframe.
Code:
import pandas as pd
df = pd.DataFrame({'Name':['Anna','Polly','Sarah','Max','Kate','Ally','Steve'],'ID':[1,1,2,3,3,3,3]})
df1 = pd.DataFrame({'Name':['Dan','Hallie','Cam','Lacy','Ryan','Colt','Tia'],'ID':[1,2,2,2,3,4,4]})
max_df = df['ID'].max()
df1['ID'] = df1['ID'].apply(lambda x: x+max_df)
final_df = pd.concat([df,df1])
print(final_df)
I have a database of name records that I'm trying to create bigrams for and have the bigrams turned into new rows in the dataframe. The reason I'm doing this is because there are certain records that contain multiple names and also some can have different orders for the same name. My ultimate goal is to look for duplicates and create one ultimate record for each unique individual. I plan to use TF-IDF and cosine similarity on the results of this. Below is an example of what I'm trying to do.
Current:
Goal:
try using zip,apply and explode:
df.Name = df.Name.str.split()
df.Name.apply(lambda x: tuple(zip(x,x[1:]))).explode().map(lambda x: f"{x[0]} {x[1]}")
Or
using list comprehension:
df2 = pd.Series([ f"{a} {b}" for val in df.Name for (a,b) in (zip(val,val[1:]))])
0 John Doe
1 John Doe
1 Doe Mike
1 Mike Smith
2 John Doe
2 Doe Mike
2 Mike Smith
2 Smith Steve
2 Steve Johnson
3 Smith Mike
3 Mike J.
3 J. Doe
3 Doe Johnson
3 Johnson Steve
4 Steve J.
4 J. M
4 M Smith
Name: Name, dtype: object
edit:
2nd part:
df2 = pd.DataFrame([ [idx+1, f"{a} {b}"] for idx,val in enumerate(df.Name) for (a,b) in (zip(val,val[1:]))], columns=['ID', 'Names'])
bigrams = [[id, ' '.join(b)] for id, l in zip(df['ID'].tolist(), df['Name'].tolist()) for b in zip(l.split(" ")[:-1], l.split(" ")[1:])]
bigrams_df = pd.DataFrame(bigrams, columns = ['ID','Name'])
A simple dataframe of 2 columns that I want to have the contents in each column to be merged (or list side by side) with each other. For example:
Seems there's a simple way but I couldn't recall and retrieve it. I tried 'explode' and 'melt', but they are not working.
import pandas as pd
data = {'name': ["David","Mike","Kate"],
'info' : ["Department","Titile","Gender"]}
df = pd.DataFrame(data)
df = df.explode('name')
print (df)
What further things can I try?
Use numpy repeat and numpy tile to build your series and recombine using pandas concat :
name = pd.Series(np.repeat(df.name.array, len(df)), name="name")
info = pd.Series(np.tile(df["info"].array, len(df)), name="info")
result = pd.concat([name, info], axis="columns")
result
name info
0 David Department
1 David Title
2 David Gender
3 Mike Department
4 Mike Title
5 Mike Gender
6 Kate Department
7 Kate Title
8 Kate Gender
You can do a cross-join on two columns using df.merge by creating a tmp column:
In [429]: df1 = pd.DataFrame(df.name, columns=['name'])
In [430]: df2 = pd.DataFrame(df['info'], columns=['info'])
In [433]: df1['tmp'] = 1
In [435]: df2['tmp'] = 1
In [438]: res = pd.merge(df1, df2, on=['tmp']).drop('tmp', 1)
In [439]: res
Out[439]:
name info
0 David Department
1 David Titile
2 David Gender
3 Mike Department
4 Mike Titile
5 Mike Gender
6 Kate Department
7 Kate Titile
8 Kate Gender
You can create two new lists that multiply the values by the total rows. One column should have the index sorted. Then, join the lists in a dataframe.
c1 = [*pd.concat([df['name']]*df.shape[0])]
c2 = [*pd.concat([df['info']]*df.shape[0]).sort_index()]
df = pd.DataFrame(list(zip(c1,c2)), columns=['name', 'info']).sort_values('name').reset_index(drop=True)
df
Out[1]:
name info
0 David Department
1 David Titile
2 David Gender
3 Kate Department
4 Kate Titile
5 Kate Gender
6 Mike Department
7 Mike Titile
8 Mike Gender
Sammy gave me an idea to make above code slightly more concise:
c1 = pd.concat([df['name']]*df.shape[0], ignore_index=True)
c2 = pd.concat([df['info']]*df.shape[0]).reset_index(drop=True)
df = pd.concat([c1, c2], axis=1)
df
Or, it could be consider from list point of view, using baby programming language:
import pandas as pd
data = {'name': ["David","Mike","Kate"],
'info' : ["Department","Titile","Gender"]}
df = pd.DataFrame(data)
names = df['name'].tolist()
all_info = df['info'].tolist()
for n in names:
for a in all_info:
print (n, a)
Output:
David Department
David Titile
David Gender
Mike Department
Mike Titile
Mike Gender
Kate Department
Kate Titile
Kate Gender
i want to group by id and get three most frequent city. For example i have original dataframe
ID City
1 London
1 London
1 New York
1 London
1 New York
1 Berlin
2 Shanghai
2 Shanghai
and result i want is like this:
ID first_frequent_city second_frequent_city third_frequent_city
1 London New York Berlin
2 Shanghai NaN NaN
First step is use SeriesGroupBy.value_counts for count values of City per ID, advantage is already values are sorted, then get counter by GroupBy.cumcount, filter first 3 values by loc, pivoting by DataFrame.pivot, change columns names and last convert ID to column by DataFrame.reset_index:
df = (df.groupby('ID')['City'].value_counts()
.groupby(level=0).cumcount()
.loc[lambda x: x < 3]
.reset_index(name='c')
.pivot('ID','c','City')
.rename(columns={0:'first_', 1:'second_', 2:'third_'})
.add_suffix('frequent_city')
.rename_axis(None, axis=1)
.reset_index())
print (df)
ID first_frequent_city second_frequent_city third_frequent_city
0 1 London New York Berlin
1 2 Shanghai NaN NaN
Another way using count as a reference to sort, then recreate dataframe by looping through groupby object:
df = (df.assign(count=df.groupby(["ID","City"])["City"].transform("count"))
.drop_duplicates(["ID","City"])
.sort_values(["ID","count"], ascending=False))
print (pd.DataFrame([i["City"].unique()[:3] for _, i in df.groupby("ID")]).fillna(np.NaN))
0 1 2
0 London New York Berlin
1 Shanghai NaN NaN
A bit long, essentially you groupby twice, first part works on the idea that grouping sorts the data in ascending order, the second part allows us to split the data into individual columns :
(df
.groupby("ID")
.tail(3)
.drop_duplicates()
.groupby("ID")
.agg(",".join)
.City.str.split(",", expand=True)
.set_axis(["first_frequent_city",
"second_frequent_city",
third_frequent_city"],
axis="columns",)
)
first_frequent_city second_frequent_city third_frequent_city
ID
1 London New York Berlin
2 Shanghai None None
Get the .count by ID and City and then use np.where() with .groupby() with max, median and min. Then set the index and unstack rows to columns on the max column.
df = df.assign(count=df.groupby(['ID', 'City'])['City'].transform('count')).drop_duplicates()
df['max'] = np.where((df['count'] == df.groupby('ID')['count'].transform('min')), 'third_frequent_city', np.nan)
df['max'] = np.where((df['count'] == df.groupby('ID')['count'].transform('median')), 'second_frequent_city', df['max'])
df['max'] = np.where((df['count'] == df.groupby('ID')['count'].transform('max')), 'first_frequent_city', df['max'])
df = df.drop('count',axis=1).set_index(['ID', 'max']).unstack(1)
output:
City
max first_frequent_city second_frequent_city third_frequent_city
ID
1 London New York Berlin
2 Shanghai NaN NaN
I want to make all column headers in my pandas data frame lower case
Example
If I have:
data =
country country isocode year XRAT tcgdp
0 Canada CAN 2001 1.54876 924909.44207
1 Canada CAN 2002 1.56932 957299.91586
2 Canada CAN 2003 1.40105 1016902.00180
....
I would like to change XRAT to xrat by doing something like:
data.headers.lowercase()
So that I get:
country country isocode year xrat tcgdp
0 Canada CAN 2001 1.54876 924909.44207
1 Canada CAN 2002 1.56932 957299.91586
2 Canada CAN 2003 1.40105 1016902.00180
3 Canada CAN 2004 1.30102 1096000.35500
....
I will not know the names of each column header ahead of time.
You can do it like this:
data.columns = map(str.lower, data.columns)
or
data.columns = [x.lower() for x in data.columns]
example:
>>> data = pd.DataFrame({'A':range(3), 'B':range(3,0,-1), 'C':list('abc')})
>>> data
A B C
0 0 3 a
1 1 2 b
2 2 1 c
>>> data.columns = map(str.lower, data.columns)
>>> data
a b c
0 0 3 a
1 1 2 b
2 2 1 c
You could do it easily with str.lower for columns:
df.columns = df.columns.str.lower()
Example:
In [63]: df
Out[63]:
country country isocode year XRAT tcgdp
0 Canada CAN 2001 1.54876 9.249094e+05
1 Canada CAN 2002 1.56932 9.572999e+05
2 Canada CAN 2003 1.40105 1.016902e+06
In [64]: df.columns = df.columns.str.lower()
In [65]: df
Out[65]:
country country isocode year xrat tcgdp
0 Canada CAN 2001 1.54876 9.249094e+05
1 Canada CAN 2002 1.56932 9.572999e+05
2 Canada CAN 2003 1.40105 1.016902e+06
If you want to do the rename using a chained method call, you can use
data.rename(columns=str.lower)
If you're not chaining method calls, you can add inplace=True
data.rename(columns=str.lower, inplace=True)
df.columns = df.columns.str.lower()
is the easiest but will give an error if some headers are numeric
if you have numeric headers then use this:
df.columns = [str(x).lower() for x in df.columns]
I noticed some of the other answers will fail if a column name is made of digits (e.g. "123"). Try these to handle such cases too.
Option 1: Use df.rename
def rename_col(old_name):
return str(old_name).lower()
df.rename(rename_col)
Option 2 (from this comment):
df.columns.astype(str).str.lower()
Another convention based on the official documentation:
frame.rename(mapper=lambda x:x.lower(), axis='columns', inplace=True)
Parameters:
mapper:
Dict-like or function transformations to apply to that axis’ values. Use either mapper and axis to specify the axis to target with mapper, or index and columns.