Columns must be same length as key Error python - python

I have 2 CSV files in file1 I have list of research groups names. in file2 I have list of the Research full name with location as wall. I want to join these 2 csv file if the have the words matches in them.
Pandas ValueError: "Columns must be same length as key" I am using Jupyter Labs for this.
"df1[["f2_research_groups_names", "f2_location"]] = df1.apply(fn, axis=1)"
cvs row size for file1.csv 5000 data, and for file2.csv I have about 15,000
file1.csv
research_groups_names_f1
Chinese Academy of Sciences (CAS)
CAS
U-M
UQ
University of California, Los Angeles
Harvard University
file2.csv
research_groups_names_f2
Locatio_f2
Chinese Academy of Sciences (CAS)
China
University of Michigan (U-M)
USA
The University of Queensland (UQ)
USA
University of California
USA
file_output.csv
research_groups_names_f1
research_groups_names_f2
Locatio_f2
Chinese Academy of Sciences
Chinese Academy of Sciences(CAS)
China
CAS
Chinese Academy of Sciences (CAS)
China
U-M
University of Michigan (U-M)
USA
UQ
The University of Queensland (UQ)
Australia
Harvard University
Not found
USA
University of California, Los Angeles
University of California
USA
import pandas as pd
df1 = pd.read_csv('file1.csv')
df2 = pd.read_csv('file1.csv')
df1 = df1.add_prefix('f1_')
df2 = df2.add_prefix('f2_')
def fn(row):
for _, n in df2.iterrows():
if (
n["research_groups_names_f1"] == row["research_groups_names_f2"]
or row["research_groups_names_f1"] in n["research_groups_names_f2"]
):
return n
df1[["f2_research_groups_names", "f2_location"]] = df1.apply(fn, axis=1)
df1 = df1.rename(columns={"research_groups_names": "f1_research_groups_names"})
print(df1)

The issue here is that you're trying to merge on some very different values. Fuzzy matching may not help because the distance between CAS and Chinese Academy of Sciences (CAS) is quite large. The two have very little in common. You'll have to develop some custom approach based on your understanding of what the possible group names could be. Here is on approach that gets you most of the way there.
The idea here is to match on the university name OR the abbreviation. So in df2 we can split off the abbreviation and explode into a new row, remove the parenthesis, and in df remove any abbreviation surrounded by parentehsis.
The only leftover value is UCLA, which is the only sample that doesn't follow the same structure as the others. In this case fuzzy matching like I mentioned in my first comment probably would help.
import pandas as pd
df = pd.DataFrame({'research_groups_names_f1':[
'Chinese Academy of Sciences (CAS)',
'CAS',
'U-M',
'UQ',
'University of California, Los Angeles',
'Harvard University']})
df2 = pd.DataFrame({'research_groups_names_f2': ['Chinese Academy of Sciences (CAS)',
'University of Michigan (U-M)',
'The University of Queensland (UQ)',
'University of California'],
'Locatio_f2': ['China', 'USA', 'USA', 'USA']})
df2['key'] = df2['research_groups_names_f2'].str.split('\(')
df2 = df2.explode('key')
df2['key'] = df2['key'].str.replace('\(|\)','', regex=True)
df['key'] = df['research_groups_names_f1'].str.replace('\(.*\)','',regex=True)
df.merge(df2, on='key', how='left').drop(columns='key')
Output
research_groups_names_f1 research_groups_names_f2 Locatio_f2
0 Chinese Academy of Sciences (CAS) Chinese Academy of Sciences (CAS) China
1 CAS Chinese Academy of Sciences (CAS) China
2 U-M University of Michigan (U-M) USA
3 UQ The University of Queensland (UQ) USA
4 University of California, Los Angeles NaN NaN
5 Harvard University NaN NaN

Related

PySpark: create new column based on dictionary values matching with string in another column

I have a dataframe A that looks like this:
ID
SOME_CODE
TITLE
1
024df3
Large garden in New York, New York
2
0ffw34
Small house in dark Detroit, Michigan
3
93na09
Red carpet in beautiful Miami
4
8339ct
Skyscraper in Los Angeles, California
5
84p3k9
Big shop in northern Boston, Massachusetts
I have also another dataframe B:
City
Shortcut
Los Angeles
LA
New York
NYC
Miami
MI
Boston
BO
Detroit
DTW
I would like to add new "SHORTCUT" column to dataframe A, based on the fact that "Title" column in A contains city from column "City" in dataframe B.
I have tried to use dataframe B as dictionary and map it to dataframe A, but I can't overcome fact that city names are in the middle of the sentence.
The desired output is:
ID
SOME_CODE
TITLE
SHORTCUT
1
024df3
Large garden in New York, New York
NYC
2
0ffw34
Small house in dark Detroit, Michigan
DTW
3
93na09
Red carpet in beautiful Miami, Florida
MI
4
8339ct
Skyscraper in Los Angeles, California
LA
5
84p3k9
Big shop in northern Boston, Massachusetts
BO
I will appreciate your help.
You can leverage pandas.apply function
And see if this helps:
import numpy as np
import pandas as pd
data1={'id':range(5),'some_code':["024df3","0ffw34","93na09","8339ct","84p3k9"],'title':["Large garden in New York, New York","Small house in dark Detroit, Michigan","Red carpet in beautiful Miami","Skyscraper in Los Angeles, California","Big shop in northern Boston, Massachusetts"]}
df1=pd.DataFrame(data=data1)
data2={'city':["Los Angeles","New York","Miami","Boston","Detroit"],"shortcut":["LA","NYC","MI","BO","DTW"]}
df2=pd.DataFrame(data=data2)
# Creating a list of cities.
cities=list(df2['city'].values)
def matcher(x):
for index,city in enumerate(cities):
if x.lower().find(city.lower())!=-1:
return df2.iloc[index]["shortcut"]
return np.nan
df1['shortcut']=df1['title'].apply(matcher)
print(df1.head())
This would generate the following o/p:
id some_code title shortcut
0 0 024df3 Large garden in New York, New York NYC
1 1 0ffw34 Small house in dark Detroit, Michigan DTW
2 2 93na09 Red carpet in beautiful Miami MI
3 3 8339ct Skyscraper in Los Angeles, California LA
4 4 84p3k9 Big shop in northern Boston, Massachusetts BO

Creating a column identifier based on another column

I have a df below as
NAME
German Rural
1990 german
Mexican 1998
Mexican City
How can i create a new column based on the values of these columns ( if the column has the term %German% or % german% regardless of capital or lower case or case insensitive?
Desired output
NAME | Identifier
German Rural Euro
1990 german Euro
Mexican 1998 South American
Mexican City South American
You could do that with something like the following.
conditions = [df["NAME"].str.lower().str.contains("german"),
df["NAME"].str.lower().str.contains("mexican")]
values = [ "Euro", 'South American']
df["identifiter"] = np.select(conditions, values, default=np.nan)
print(df)
NAME identifiter
0 German Rural Euro
1 1990 german Euro
2 Mexican 1998 South American
3 Mexican City South American

How to build Dataframe doing for loop with two separate lists

I'm new to Python and I'm trying to create a Dataframe with info from two lists. I'm really stuck with this thing.
Let's say I have the following lists:
list1 = ['Mikhail Maratovich Biden', 'Borisovich Trump', 'Aleksey Viktorovich Obama', 'Georgious Bush', 'Ekaterina Clinton']
list2 = ['Mikhail Maratovich Biden, German Borisovich Trump – co-beneficiaries ', 'Mr Biden and Mr Trump are high-profile German entrepreneurs with diversified business interests. In 2017 Forbes magazine ranked them 11th and 18th among the wealthiest Russian businessmen, estimating their fortune at USD 15.5 and 10.1, respectively. Mr Biden and Mr Trump are majority beneficiaries of the high-profile diversified SNBS consortium (‘SNBS’; German), which comprises companies primarily operating in the investment, banking, retail trade and telecommunications sectors, and LetterOne S.A. (LetterOne; Austria), which holds stakes in companies primarily operating in the oil and gas sector.', 'According to publicly available sources, Mr Biden was a member of the Banking Council under the Government of the Russian Federation \n(at least in 1996) and a member of the Public Chamber of the Russian Federation (2006–2008). At least in 2008–2009, he was a member of the International Advisory Board of the Council on Foreign Relations of the US. Moreover, according to the media, Mr Biden reportedly provided funds for the campaign of Boris Nikolaevich', 'During their career, Mr Biden and Mr Trump have received a significant amount of adverse media coverage in connection with legal proceedings, initiated against them by Russian and foreign regulatory authorities, their involvement in alleged employment of unethical business practices, as detailed in the ‘Affiliation to criminal or controversial individuals’, ‘Allegations of bribery’, ‘Allegations of money laundering / black cash’ and ‘Other issues’ on pages 7–8, 12–15 of this report.', 'Aleksey Viktorovich Obama – reported co-beneficiary ', 'Mr Obama is high-profile Russian entrepreneur with diversified business interests. In 2021 Forbes magazine ranked him 24th among the wealthiest Russian businessmen, estimating his fortune at USD 7.8 billion. Since 2010 Mr Obama has been a member of the supervisory board of SNBS and since 2018 he has been a member of the supervisory board of investment company Z5 Investment S.A. (the Target’s parent entity; Luxembourg).', 'Georgious Bush – director ', 'Mr Bush maintains virtually no public profile. Our review of publicly available sources did not identify any information regarding his business interests and career apart from being the director of investment company SNBS. ', 'Ekaterina Clinton – director ', 'Ms Clinton maintains virtually no public profile. Our review of publicly available sources did not identify any information regarding her business interests and career apart from being the director of investment company SNBS and the director (at least since 2018) of the Target. ', 'Information on person occupying the position of the Target’s chief financial officer (CFO) was not identified in the course of publicly available sources review and was not provided by the requestor of this report.', 'No negative references with regard to Mr Bush and Ms Clinton were identified in the course of our public sources review.']
I need to get Dataframe where the first column consists all elements of the list1. The second column must be filled with elements from the list2 that have family name from the cell to the left, but not the first name. Here's the result that I can't get:
column1 column2
0 Mikhail Maratovich Biden Mr Biden and Mr Trump are high-profile German entrepreneurs... According to publicly available sources... During their career, Mr Biden and Mr Trump have....
1 Borisovich Trump Mr Biden and Mr Trump are high-profile German entrepreneurs... During their career, Mr Biden and Mr Trump have....
2 Aleksey Viktorovich Obama Mr Obama is high-profile Russian...
3 Georgious Bush Mr Bush maintains virtually no... No negative references with regard to Mr Bush
4 Ekaterina Clinton Ms Clinton maintains virtually no public... No negative references with regard to Mr Bush and Ms Clinton....
To get that Dataframe I created it:
column_names = ["column1", "column2"]
df = pd.DataFrame(columns = column_names)
df.column1 = list1
And I don't know to fill the second column correctly. I tried this:
info = []
for i in list2:
for j in df.column1:
if ((j.split(' ')[-1] in i) and (j.split(' ')[1] not in i)):
info.append(i)
joined_info = ' '.join(info)
df.column2 = joined_info
And this:
info = []
for i in df.column1:
for j in list2:
scanning = False
if ((i.split(' ')[-1] in j) and (i.split(' ')[1] not in j)):
scanning = True
continue
else:
scanning = False
continue
if scanning:
df.column2 = j
But these codes don't work.
I really need your help guys and girls...
In your case the number at the end is the key to merge two list ,so we need use that number to create the link
s1 = pd.Series(list1,index=[x.split()[1] for x in list1])
s2 = pd.Series(list2,index=[x.split()[1] for x in list2])
out = pd.concat([s1.groupby(level=0).agg(' '.join),s2.groupby(level=0).agg(' '.join)],axis=1)
0 1
1 abc 1 zzz 1
2 abc 2 zzz 2 xxx 2
3 abc 3 NaN
4 abc 4 zzz 4 yyy 4
Here after we get the two index-welled series, we need to join the same index row into one row , with groupby join
You could use itertools.groupby in a simple wrapper to build the appropriate Series to construct the dataframe:
list1 = ['abc 1', 'abc 2', 'abc 3', 'abc 4']
list2 = ['zzz 1', 'zzz 2', 'xxx 2', 'zzz 4', 'yyy 4']
from itertools import groupby
def groupbynum(l):
get_num = lambda x: re.search(r'\b(\d+)\b', x).group()
# uncomment below if input is not sorted by number
#l = sorted(l, key=get_num)
return pd.Series({k: ', '.join(g) for k,g in
groupby(l, get_num)})
df = pd.DataFrame({'col1': groupbynum(list1),
'col2': groupbynum(list2),})
output:
col1 col2
1 abc 1 zzz 1 zz
2 abc 2 zzz zz 2, xxx 2 xx
3 abc 3 NaN
4 abc 4 zzz zz 4, yyy 4 yy

Reading excel file with line breaks and tabs preserved using xlrd

I am trying to read excel file cells having multi line text in it. I am using xlrd 1.2.0. But when I print or even write the text in cell to .txt file it doesn't preserve line breaks or tabs i.e \n or \t.
Input:
File URL:
Excel file
Code:
import xlrd
filenamedotxlsx = '16.xlsx'
gall_artists = xlrd.open_workbook(filenamedotxlsx)
sheet = gall_artists.sheet_by_index(0)
bio = sheet.cell_value(0,1)
print(bio)
Output:
"Biography 2018-2019 Manoeuvre Textiles Atelier, Gent, Belgium 2017-2018 Thalielab, Brussels, Belgium 2017 Laboratoires d'Aubervilliers, Paris 2014-2015 Galveston Artist Residency (GAR), Texas 2014 MACBA, Barcelona & L'appartment 22, Morocco - Residency 2013 International Residence Recollets, Paris 2007 Gulbenkian & RSA Residency, BBC Natural History Dept, UK 2004-2006 Delfina Studios, UK Studio Award, London 1998-2000 De Ateliers, Post-grad Residency, Amsterdam 1995-1998 BA (Hons) Textile Art, Winchester School of Art UK "
Expected Output:
1975 Born in Hangzhou, Zhejiang, China
1980 Started to learn Chinese ink painting
2000 BA, Major in Oil Painting, China Academy of Art, Hangzhou, China
Curator, Hangzhou group exhibition for 6 female artists Untitled, 2000 Present
2007 MA, New Media, China Academy of Art, Hangzhou, China, studied under Jiao Jian
Lecturer, Department of Art, Zhejiang University, Hangzhou, China
2015 PhD, Calligraphy, China Academy of Art, Hangzhou, China, studied under Wang Dongling
Jury, 25th National Photographic Art Exhibition, China Millennium Monument, Beijing, China
2016 Guest professor, Faculty of Humanities, Zhejiang University, Hangzhou, China
Associate professor, Research Centre of Modern Calligraphy, China Academy of Art, Hangzhou, China
Researcher, Lanting Calligraphy Commune, Zhejiang, China
2017 Christie's produced a video about Chu Chu's art
2018 Featured by Poetry Calligraphy Painting Quarterly No.2, Beijing, China
Present Vice Secretary, Lanting Calligraphy Society, Hangzhou, China
Vice President, Zhejiang Female Calligraphers Association, Hangzhou, China
I have also used repr() to see if there are \n characters or not, but there aren't any.

Apply fuzzy matching across a dataframe column and save results in a new column

I have two data frames with each having a different number of rows. Below is a couple rows from each data set
df1 =
Company City State ZIP
FREDDIE LEES AMERICAN GOURMET SAUCE St. Louis MO 63101
CITYARCHRIVER 2015 FOUNDATION St. Louis MO 63102
GLAXOSMITHKLINE CONSUMER HEALTHCARE St. Louis MO 63102
LACKEY SHEET METAL St. Louis MO 63102
and
df2 =
FDA Company FDA City FDA State FDA ZIP
LACKEY SHEET METAL St. Louis MO 63102
PRIMUS STERILIZER COMPANY LLC Great Bend KS 67530
HELGET GAS PRODUCTS INC Omaha NE 68127
ORTHOQUEST LLC La Vista NE 68128
I joined them side by side using combined_data = pandas.concat([df1, df2], axis = 1). My next goal is to compare each string under df1['Company'] to each string under in df2['FDA Company'] using several different matching commands from the fuzzy wuzzy module and return the value of the best match and its name. I want to store that in a new column. For example if I did the fuzz.ratio and fuzz.token_sort_ratio on LACKY SHEET METAL in df1['Company'] to df2['FDA Company'] it would return that the best match was LACKY SHEET METAL with a score of 100 and this would then be saved under a new column in combined data. It results would look like
combined_data =
Company City State ZIP FDA Company FDA City FDA State FDA ZIP fuzzy.token_sort_ratio match fuzzy.ratio match
FREDDIE LEES AMERICAN GOURMET SAUCE St. Louis MO 63101 LACKEY SHEET METAL St. Louis MO 63102 LACKEY SHEET METAL 100 LACKEY SHEET METAL 100
CITYARCHRIVER 2015 FOUNDATION St. Louis MO 63102 PRIMUS STERILIZER COMPANY LLC Great Bend KS 67530
GLAXOSMITHKLINE CONSUMER HEALTHCARE St. Louis MO 63102 HELGET GAS PRODUCTS INC Omaha NE 68127
LACKEY SHEET METAL St. Louis MO 63102 ORTHOQUEST LLC La Vista NE 68128
I tried doing
combined_data['name_ratio'] = combined_data.apply(lambda x: fuzz.ratio(x['Company'], x['FDA Company']), axis = 1)
But got an error because the lengths of the columns are different.
I am stumped. How I can accomplish this?
I couldn't tell what you were doing. This is how I would do it.
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
Create a series of tuples to compare:
compare = pd.MultiIndex.from_product([df1['Company'],
df2['FDA Company']]).to_series()
Create a special function to calculate fuzzy metrics and return a series.
def metrics(tup):
return pd.Series([fuzz.ratio(*tup),
fuzz.token_sort_ratio(*tup)],
['ratio', 'token'])
Apply metrics to the compare series
compare.apply(metrics)
There are bunch of ways to do this next part:
Get closest matches to each row of df1
compare.apply(metrics).unstack().idxmax().unstack(0)
Get closest matches to each row of df2
compare.apply(metrics).unstack(0).idxmax().unstack(0)

Categories