How to check if value exists in another dataframe in pandas? - python

I have a dataframe below that contains mapping between french to english
df1
french english
ksjks an
sjk def
ssad sdsd
And another dataframe columns are in french so need to convert them into english by using df1
df2
ksjks sjk ssad
2 4 6
how can we achieve that?
new_cols = []
for col in df2.columns:
if col in df1['french']:
how to get corresponding english value
PS: Have just put random data in for sample

Option 1
Use map with set_index
df2.columns = df2.columns.map(df1.set_index('french').english)
print(df2)
Option 2
Use rename with set_index:
df2.rename(columns=df1.set_index('french').english.to_dict())
Both produce:
an def sdsd
0 2 4 6
Order of the columns doesn't matter:
df1 = pd.DataFrame({'french': ['un', 'deux', 'trois'], 'english': ['one', 'two', 'three']})
df2 = pd.DataFrame([[1,2,3]], columns=['trois', 'deux', 'un'])
df2.rename(columns=df1.set_index('french').english.to_dict())
three two one
0 1 2 3
df2.columns.map(df1.set_index('french').english)
# Index(['three', 'two', 'one'], dtype='object')

df_lookup= pd.DataFrame({"French":["ksjks","sjk","ssad"],"english":
["an","def","sdsd"]})
df_actual=pd.DataFrame({"ksjks":[2],"sjk":[4],"ssad":[6]})
df_actual.columns=[ df_lookup.loc[df_lookup.loc[df_lookup["French"]==
col].index[0],"english"] for col in df_actual.columns]

Related

pandas pivot data Cols to rows and rows to cols

I am using python and pandas have tried a variety of attempts to pivot the following (switch the row and columns)
Example:
A is unique
A B C D E... (and so on)
[0] apple 2 22 222
[1] peach 3 33 333
[N] ... and so on
And I would like to see
? ? ? ? ... and so on
A apple peach
B 2 3
C 22 33
D 222 333
E
... and so on
I am ok if the columns are named after the col "A", and if the first column needs a name, lets call it "name"
name apple peach ...
B 2 3
C 22 33
D 222 333
E
... and so on
Think you're wanting transpose here.
df = pd.DataFrame({'A': {0: 'apple', 1: 'peach'}, 'B': {0: 2, 1: 3}, 'C': {0: 22, 1: 33}})
df = df.T
print(df)
0 1
A apple peach
B 2 3
C 22 33
Edit for comment. I would probably reset the index and then use the df.columns to update the column names with a list. You may want to reset the index again at the end as needed.
df.reset_index(inplace=True)
df.columns = ['name', 'apple', 'peach']
df = df.iloc[1:, :]
print(df)
name apple peach
1 B 2 3
2 C 22 33
try df.transpose() it should do the trick
Taking the advice from the other posts, and a few other tweaks (explained in line) here is what worked for me.
# get the key column that will become the column names.
# add the column name for the existing columns
cols = df['A'].tolist()
cols.append('name')
# Transform
df = df.T
# the transform takes the column, and makes it an index column.
# need to add it back into the data set (you might want to drop
# the index later to get rid if it all together)
df['name'] = df.index
# now to rebuild the columns and move the new "name" column to the first col
df.columns = cols
cols = df.columns.tolist()
cols = cols[-1:] + cols[:-1]
df = df[cols]
# remove the row, (was the column we used for the column names
df = df.iloc[1:, :]

Performing a pandas grouby on a list containing 2 and 3 character string subset of a column

Say I have a simple dataframe with the names of people. I perform a groupby on name
import pandas as pd
df = pd.DataFrame({'col1' : [1,2,3,4,5,6,7], 'name': ['George', 'John', 'Tim', 'Joe', 'Issac', 'George', 'Tim'] })
df1 = df.groupby('name')
Question: How can I select out a table of specific names out of a list which contains a string subset of the names, either 2 or 3 characters?
e.g say I have the following list where both Tim & Geo are the first 3 characters of some entries in the name column and Jo is the first 2 characters of a certain entry in the name column.
list = ['Jo', 'Tim', 'Geo']
Attempted: My initial thought was to create new columns in the original dataframe which were either a 2 or 3 character subset of the name column and then try grouping by that however since 2 and 3 string characters are different the grouping wouldn't output the correct result.
Not sure whether it would be better to use some if condition such as if v in list is len(2) groupby(2char) else groupby(3char) and output the result as 1 dataframe.
list
df1['name_2char_subset] = df1['name'].str[0:2]
df1['name_3char_subset] = df1['name'].str[0:3]
if v in list is len(2):
df2 = df1.groupby('name_2char_subset')
else:
df2 = df1.groupby('name_3char_subset')
Desired Output: Since there are 2 counts of each of Jo, Geo & Tim. The output should group by each case. ie for Jo there are both John & Joe hence a count of 2 in the groupby.
df3 = pd.DataFrame({'name': ['Jo', 'Tim', 'Geo'], 'col1': [2,2,2]})
How could we group by name and output the entries in name which have the given initial characters as in the list?
Any alternative ways of doing this will be helpful. For example, can perform this in the group by of extract values in the list after the group by has been performed.
First dont use list for variable, because python code word. Then use Series.str.extract for test if match by starting of strings by ^ and count in Series.value_counts:
L = ['Jo', 'Tim', 'Geo']
pat = '|'.join(r"^{}".format(x) for x in L)
df = (df['name'].str.extract('('+ pat + ')', expand=False)
.dropna()
.value_counts()
.reindex(L, fill_value=0)
.rename_axis('name')
.reset_index(name='col1'))
print (df)
name col1
0 Jo 2
1 Tim 2
2 Geo 2
Your solution:
L = ['Jo', 'Tim', 'Geo']
s1 = df['name'].str[:2]
s2 = df['name'].str[:3]
df = (s1.where(s1.isin(L)).fillna(s2.where(s2.isin(L)))
.dropna()
.value_counts()
.reindex(L, fill_value=0)
.rename_axis('name')
.reset_index(name='col1'))
print (df)
name col1
0 Jo 2
1 Tim 2
2 Geo 2
Solution from deleted answer with change by Series.str.startswith for test if starting string by list:
L = ['Jo', 'Tim', 'Geo']
df3 = pd.DataFrame({'name': L})
df3['col1'] = df3['name'].apply(lambda x: sum(df['name'].str.startswith(x)))
print (df3)
name col1
0 Jo 2
1 Tim 2
2 Geo 2
EDIT: If need groupby more columns use first or second solution, assign columns back and aggregate by names aggregation in GroupBy.agg:
df = pd.DataFrame({'age' : [1,2,3,4,5,6,7],
'name': ['George', 'John', 'Tim', 'Joe', 'Issac', 'George', 'Tim'] })
print (df)
L = ['Jo', 'Tim', 'Geo']
pat = '|'.join(r"^{}".format(x) for x in L)
df['name'] = df['name'].str.extract('('+ pat + ')', expand=False)
df = df.groupby('name').agg(sum_age=('age','sum'), col1=('name', 'count'))
print (df)
sum_age col1
name
Geo 7 2
Jo 6 2
Tim 10 2

Dropping the first row of a dataframe when looping through a list of dataframes

I am trying to write a function to loop through a list of dataframes containing tables I pulled from a website using pd.read_html. I want to drop the first row in each dataframe, and tried with the function I wrote below but it's not working. Does anyone know why?
for df in df_list:
df.columns = df.iloc[0]
df.drop(df.index[0])
df_list[0]
**Hospital/Location Specialty**
0 Hospital/Location Specialty
1 Maimonides Med Ctr-NY Maimonides Med Ctr-NY Medicine-Preliminary Anesthesiology
2 Jacobi Med Ctr/Einstein-NY Pediatrics
3 Jacobi Med Ctr/Einstein-NY Pediatrics
4 Temple Univ Hosp-PA Internal Medicine
You need to assign it back to df.
Like this,
df=df.drop(df.index[0])
It removed index 0 from my dataframe. And the dataframe now starts at index 1.
Let us assign it back
for idx, df in enumerate(df_list):
df.columns = df.iloc[0]
df_list[idx]=df.drop(df.index[0])
why not use a comprehension
# test data:
df1 = pd.DataFrame({0: ['col1', 'A', 'B'], 1: ['col2', '1', '2']})
df2 = pd.DataFrame({0: ['colA', 'a', 'b'], 1: ['colB', 'hello', 'goodbye']})
dfs = [df1, df2]
renamed = [d.rename(columns=df1.iloc[0]).drop(0) for d in dfs]
for df in renamed:
print(df)
# outputs:
col1 col2
1 A 1
2 B 2
colA colB
1 a hello
2 b goodbye

Pandas mapping all, and a portion, of column value in another column

I am trying to search for values and portions of values from one column to another and return a third value.
Essentially, I have two dataframes: df and df2. The first has a part number in 'col1'. The second has the part number, or portion of it, in 'col1' and the value I want to put in df['col2'] in 'col2'.
import pandas as pd
df = pd.DataFrame({'col1': ['1-1-1', '1-1-2', '1-1-3',
'2-1-1', '2-1-2', '2-1-3']})
df2 = pd.DataFrame({'col1': ['1-1-1', '1-1-2', '1-1-3', '2-1'],
'col2': ['A', 'B', 'C', 'D']})
Of course this:
df['col1'].isin(df2['col1'])
Only covers everything that matches, not the portions:
df['col1'].isin(df2['col1'])
Out[27]:
0 True
1 True
2 True
3 False
4 False
5 False
Name: col1, dtype: bool
I tried:
df[df['col1'].str.contains(df2['col1'])]
but get:
TypeError: 'Series' objects are mutable, thus they cannot be hashed
I also tried use a dictionary made from df2; using the same approaches as above and also mapping it--with no luck
The results for df I need would look like this:
col1 col2
'1-1-1' 'A'
'1-1-2' 'B'
'1-1-3' 'C'
'2-1-1' 'D'
'2-1-2' 'D'
'2-1-3' 'D'
I can't figure out how to get the 'D' value into 'col2' because df2['col1'] contains '2-1'--only a portion of the part number.
Any help would be greatly appreciated. Thank you in advance.
We can do str.findall
s=df.col1.str.findall('|'.join(df2.col1.tolist())).str[0].map(df2.set_index('col1').col2)
df['New']=s
df
col1 New
0 1-1-1 A
1 1-1-2 B
2 1-1-3 C
3 2-1-1 D
4 2-1-2 D
5 2-1-3 D
If your df and df2 the specific format as in the sample, another way is using a dict map with fillna by mapping from rsplit
d = dict(df2[['col1', 'col2']].values)
df['col2'] = df.col1.map(d).fillna(df.col1.str.rsplit('-',1).str[0].map(d))
Out[1223]:
col1 col2
0 1-1-1 A
1 1-1-2 B
2 1-1-3 C
3 2-1-1 D
4 2-1-2 D
5 2-1-3 D
Otherwise, besides using findall as in Wen's solution, you may also use extract using with dict d from above
df.col1.str.extract('('+'|'.join(df2.col1)+')')[0].map(d)

move column in pandas dataframe

I have the following dataframe:
a b x y
0 1 2 3 -1
1 2 4 6 -2
2 3 6 9 -3
3 4 8 12 -4
How can I move columns b and x such that they are the last 2 columns in the dataframe? I would like to specify b and x by name, but not the other columns.
You can rearrange columns directly by specifying their order:
df = df[['a', 'y', 'b', 'x']]
In the case of larger dataframes where the column titles are dynamic, you can use a list comprehension to select every column not in your target set and then append the target set to the end.
>>> df[[c for c in df if c not in ['b', 'x']]
+ ['b', 'x']]
a y b x
0 1 -1 2 3
1 2 -2 4 6
2 3 -3 6 9
3 4 -4 8 12
To make it more bullet proof, you can ensure that your target columns are indeed in the dataframe:
cols_at_end = ['b', 'x']
df = df[[c for c in df if c not in cols_at_end]
+ [c for c in cols_at_end if c in df]]
cols = list(df.columns.values) #Make a list of all of the columns in the df
cols.pop(cols.index('b')) #Remove b from list
cols.pop(cols.index('x')) #Remove x from list
df = df[cols+['b','x']] #Create new dataframe with columns in the order you want
For example, to move column "name" to be the first column in df you can use insert:
column_to_move = df.pop("name")
# insert column with insert(location, column_name, column_value)
df.insert(0, "name", column_to_move)
similarly, if you want this column to be e.g. third column from the beginning:
df.insert(2, "name", column_to_move )
You can use to way below. It's very simple, but similar to the good answer given by Charlie Haley.
df1 = df.pop('b') # remove column b and store it in df1
df2 = df.pop('x') # remove column x and store it in df2
df['b']=df1 # add b series as a 'new' column.
df['x']=df2 # add b series as a 'new' column.
Now you have your dataframe with the columns 'b' and 'x' in the end. You can see this video from OSPY : https://youtu.be/RlbO27N3Xg4
similar to ROBBAT1's answer above, but hopefully a bit more robust:
df.insert(len(df.columns)-1, 'b', df.pop('b'))
df.insert(len(df.columns)-1, 'x', df.pop('x'))
This function will reorder your columns without losing data. Any omitted columns remain in the center of the data set:
def reorder_columns(columns, first_cols=[], last_cols=[], drop_cols=[]):
columns = list(set(columns) - set(first_cols))
columns = list(set(columns) - set(drop_cols))
columns = list(set(columns) - set(last_cols))
new_order = first_cols + columns + last_cols
return new_order
Example usage:
my_list = ['first', 'second', 'third', 'fourth', 'fifth', 'sixth']
reorder_columns(my_list, first_cols=['fourth', 'third'], last_cols=['second'], drop_cols=['fifth'])
# Output:
['fourth', 'third', 'first', 'sixth', 'second']
To assign to your dataframe, use:
my_list = df.columns.tolist()
reordered_cols = reorder_columns(my_list, first_cols=['fourth', 'third'], last_cols=['second'], drop_cols=['fifth'])
df = df[reordered_cols]
Simple solution:
old_cols = df.columns.values
new_cols= ['a', 'y', 'b', 'x']
df = df.reindex(columns=new_cols)
An alternative, more generic method;
from pandas import DataFrame
def move_columns(df: DataFrame, cols_to_move: list, new_index: int) -> DataFrame:
"""
This method re-arranges the columns in a dataframe to place the desired columns at the desired index.
ex Usage: df = move_columns(df, ['Rev'], 2)
:param df:
:param cols_to_move: The names of the columns to move. They must be a list
:param new_index: The 0-based location to place the columns.
:return: Return a dataframe with the columns re-arranged
"""
other = [c for c in df if c not in cols_to_move]
start = other[0:new_index]
end = other[new_index:]
return df[start + cols_to_move + end]
You can use pd.Index.difference with np.hstack, then reindex or use label-based indexing. In general, it's a good idea to avoid list comprehensions or other explicit loops with NumPy / Pandas objects.
cols_to_move = ['b', 'x']
new_cols = np.hstack((df.columns.difference(cols_to_move), cols_to_move))
# OPTION 1: reindex
df = df.reindex(columns=new_cols)
# OPTION 2: direct label-based indexing
df = df[new_cols]
# OPTION 3: loc label-based indexing
df = df.loc[:, new_cols]
print(df)
# a y b x
# 0 1 -1 2 3
# 1 2 -2 4 6
# 2 3 -3 6 9
# 3 4 -4 8 12
You can use movecolumn package in Python to move columns:
pip install movecolumn
Then you can write your code as:
import movecolumn as mc
mc.MoveToLast(df,'b')
mc.MoveToLast(df,'x')
Hope that helps.
P.S : The package can be found here. https://pypi.org/project/movecolumn/
You can also do this as a one-liner:
df.drop(columns=['b', 'x']).assign(b=df['b'], x=df['x'])
This will move any column to the last column :
Move any column to the last column of dataframe :
df= df[ [ col for col in df.columns if col != 'col_name_to_moved' ] + ['col_name_to_moved']]
Move any column to the first column of dataframe:
df= df[ ['col_name_to_moved'] + [ col for col in df.columns if col != 'col_name_to_moved' ]]
where col_name_to_moved is the column that you want to move.
I use Pokémon database as an example, the columns for my data base are
['Name', '#', 'Type 1', 'Type 2', 'Total', 'HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed', 'Generation', 'Legendary']
Here is the code:
import pandas as pd
df = pd.read_html('https://gist.github.com/armgilles/194bcff35001e7eb53a2a8b441e8b2c6')[0]
cols = df.columns.to_list()
cos_end= ["Name", "Total", "HP", "Defense"]
for i, j in enumerate(cos_end, start=(len(cols)-len(cos_end))):
cols.insert(i, cols.pop(cols.index(j)))
print(cols)
df = df.reindex(columns=cols)
print(df)

Categories