Creating ID for every row based on the observations in variable - python

A want to create a system where the observations in a variable refer to a number using Python. All the numbers from the (in this case) 5 different variables together form a unique code. The first number corresponds to the first variable. When an observations in a different row is the same as the first, the same number applies. As illustrated in the example, If apple appears in row 1 and 3, both ID's get a '1' as first number.
The output should give a new column with the ID. If all the observations in a row are the same, the ID's will be the same. In the picture below you see 5 variables leading to the unique ID on the right, which should be the output.

You can use pd.factorize:
df['UniqueID'] = (df.apply(lambda x: (1+pd.factorize(x)[0]).astype(str))
.agg(''.join, axis=1))
print(df)
# Output
Fruit Toy Letter Car Country UniqueID
0 Apple Bear A Ferrari Brazil 11111
1 Strawberry Blocks B Peugeot Chile 22222
2 Apple Blocks C Renault China 12333
3 Orange Bear D Saab China 31443
4 Orange Bear D Ferrari India 31414

Related

Matching Strings and Count Frequency

I have a list of companies with their subsidiaries, the data looks as below:
CompanyName Employees
Microsoft China 1
Microsoft India 1
Microsoft Europe 1
Apple Inc 1
Apple Data Inc 1
Apple Customer Service Inc 1
Data Corp 1
Data SHCH 1
Data India 1
City Corp 1
Data City 1
If two companies have same words (e.g. Apple Inc and Apple Data Inc), they are considered one company. I will group those companies together, and calculate their total number of employees.
The expected return should be:
Company Employees
Microsft 3
Apple 3
Data 3
City 2
The company will return the common word
The Employees return the sum of company and its subsidiaries
Most of the pandas function doesn't really work in this case. Any suggestions on For Loop?
As you requested in the comments
If the company is always the first word in CompanyName
# extract company as word at index 0
df.CompanyName = df.CompanyName.str.split(expand=True)[0]
# groupby company name and count
dfg = df.groupby('CompanyName').agg({'CompanyName': 'count'})
# display(dfg)
CompanyName
CompanyName
Apple 3
City 1
Data 4
Microsoft 3
I don't think there's a 'very' simple way to do what you want. But it's not too complex too.
First, you need to define clearly the ~criterion to decide wich names are the same 'company'.
We can try with "get the first world and see if it matches", obviously it's not a perfect approach, but it'll do for now.
Then, you can create an object to store your new data. I would recommend a dictionary, with entries like company: (total employees).
You'll now iterate over the rows of the dataframe, with apply and a function to do what you want. It'll look like this:
dict = {}
def aggregator(row):
word1 = row.company.split(" ")[0]
if word1 in dict.keys:
dict[word1] += row.employees
else:
dict[word1] = row.employees
dataframe.apply(aggregator, axis = 1)

How can I iterate through all rows of a dataframe to apply a look up function to a string value and apply the result to a new column?

I have a dataframe with several columns of personal data for each row (person). I want to apply a function to look up each person's city or state in regional lists, and then apply the result to a new column "Region" in the same dataframe.
I have been able to make the same operation work with a very simplified dataframe with categories for colors and vehicles (see below). But when I try to do it with the personal data, it won't work the same way and I don't understand why.
I've read through many theads on lambda functions, but I think what I'm asking is too complex for that. Most solutions deal with numerical data and I'm using strings, but as I said, I was able to make it work with one dataset. Obviously I'm new here. I'd also appreciate advice on how to build the new column as part of the function instead of having to build it as a separate step, but that isn't frustrating me as much as the main question.
This example works:
# Python: import pandas
import pandas as pd
# Simple dataframe. Empty column 'type'.
df = pd.DataFrame({'one':['1','2','3','4','5','6','7','8'],
'two':['A','B','C','D','E','F','G','H'],
'three': ['car','bus','red','blue','truck','pencil','yellow','green'],
'type':''})
df displays:
one two three type
0 1 A car
1 2 B bus
2 3 C red
3 4 D blue
4 5 E truck
5 6 F pencil
6 7 G yellow
7 8 H green
Now define lists and custom function:
# Definte lists of colors and vehicles
colors = ['red','blue','green','yellow']
vehicles = ['car','truck','bus','motorcycle']
# Create function 'celltype' to return values based on x
def celltype (x):
if x in colors: return 'color'
elif x in vehicles: return 'vehicle'
else: return 'other'
Then construct a loop to iterate through each row and apply the function:
# Write loop to iterate through df rows and apply function 'celltype' to column 'three' in each row
for index, row in df.iterrows():
row['type'] = celltype(row['three'])
And in this case the result is just what I want:
one two three type
0 1 A car vehicle
1 2 B bus vehicle
2 3 C red color
3 4 D blue color
4 5 E truck vehicle
5 6 F pencil other
6 7 G yellow color
7 8 H green color
This example doesn't work, and I don't know why:
df1 = pd.DataFrame({'Last Name':['SMITH','JONES','WILSON','DOYLE','ANDERSON'], 'First Name':['TOM','DICK','HARRY','MICHAEL','KEVIN'],
'Code':[12,34,56,78,90], 'Deparment':['Research','Management','Maintenance','Marketing','IT'],
'City':['NEW YORK','BOSTON','SAN FRANCISCO','DALLAS','DETROIT'], 'State':['NY','MA','CA','TX','MI'], 'Region':''})
df1 displays:
Last Name First Name Code Deparment City State Region
0 SMITH TOM 12 Research NEW YORK NY
1 JONES DICK 34 Management BOSTON MA
2 WILSON HARRY 56 Maintenance SAN FRANCISCO CA
3 DOYLE MICHAEL 78 Marketing DALLAS TX
4 ANDERSON KEVIN 90 IT DETROIT MI
Again, defining lists and functions:
# Define lists for regions
east = ['NEW YORK','BOSTON']
west = ['SAN FRANCISCO','LOS ANGELES']
south = ['TX']
# Create function 'region' to return values based on x
def region (x):
if x in east: return 'east'
elif x in west: return 'west'
elif x in south: return 'south'
else: return 'other'
# Write loop to iterate through df1 rows and apply function 'region' to column 'City' in each row
for index, row in df1.iterrows():
row['Region'] = region(row['City'])
if row['Region'] == 'other': row['Region'] = region(row['State'])
This results in an unchanged df1. The 'Region' column is still blank. We should see "east", "east", "west", "south", "other". The only difference in the code is the additional 'if' statement, to catch Dallas by state (which is something I need for my real world dataset). But I think that line is sound and I get the same result without it.
First off, apply and iterrows are slow, so try not to use them, ever.
What I usually do in this situation is to create a pair of forward and backward dicts:
forward = {'east': east,
'west': west,
'south': south}
backward = {x:k for k,v in forward.items() for x in v}
And then update with map. Since you want to update based on two columns, fillna will be helpful:
df1['Region'] = (df1['State'].map(backward)
.fillna(df1['City'].map(backward))
.fillna('other')
)
gives:
Last Name First Name Code Deparment City State Region
0 SMITH TOM 12 Research NEW YORK NY east
1 JONES DICK 34 Management BOSTON MA east
2 WILSON HARRY 56 Maintenance SAN FRANCISCO CA west
3 DOYLE MICHAEL 78 Marketing DALLAS TX south
4 ANDERSON KEVIN 90 IT DETROIT MI other
Your issue is with using iterrows. You should, in general, never modify something you are iterating over. In this case, the iterrows is creating a copy of your data and so is not actually modifying your df1. The copy is something that might or might not happen depending on the circumstances, so something like this is something you generally want to avoid doing.
You can make sure it modifies the original by calling the Dataframe directly with at:
for index, row in df1.iterrows():
df1.at[index, 'Region'] = region(row['City'])
if df1.at[index, 'Region'] == 'other': df1.at[index, 'Region'] = region(row['State'])

how to deal with a copy-pasted table in pandas- reshaping a column vector

I have a table I copied from a webpage which when pasted into librecalc or excel occupies a single cell, and when pasted into notebook becomes a 3507x1 column. If I import this as a pandas dataframe using pd.read_csv I see the same 3507x1 column , and I'd now like to reshape it into the 501x7 array that it started as.
I thought I could recast as a numpy array, reshape as I am familiar with in numpy and then put back into a df, but the to_numpy methods of pandas seem to want to work with a Series object (not Dataframe) and attempts to read the file into a Series using eg
ser= pd.Series.from_csv('billionaires')
led to tokenizing errors. Is there some simple way to do this? Maybe I should throw in the towel on this direction and read from the html?
A simple copy paste does not give you any clear column separator, so it's impossible to do it easily.
You have only spaces, but spaces may or may not be inside the column values too (like in the name or country) so is impossible to give to DataFrame.read_csv a column separator.
However, if I copy paste the table in a file, I notice regularity.
If you know regex, you can try using pandas.Series.str.extract. This method extracts capture groups in a regex pattern as columns of a DataFrame. The regex is applied to each element / string of the series.
You can then try to find a regex pattern to capture the various elements of the row to split them into separate columns.
df = pd.read_csv('data.txt', names=["A"]) #no header in the file
ss = df['A']
rdf = ss.str.extract('(\d)\s+(.+)(\$[\d\.]+B)\s+([+-]\$[\d\.]+[BM])\s+([+-]\$[\d\.]+B)\s+([\w\s]+)\s+([\w\s]+)')
Here I tried to write a regex for the table in the link, the result on the first seems pretty good.
0 1 2 3 4 5 6
0 1 Jeff Bezos $121B +$231M -$3.94B United States Technology
1 3 Bernard Arnault $104B +$127M +$35.7B France Consumer
2 4 Warren Buffett $84.9B +$66.3M +$1.11B United States Diversified
3 5 Mark Zuckerberg $76.7B -$301M +$24.6B United States Technology
4 6 Amancio Ortega $66.5B +$303M +$7.85B Spain Retail
5 7 Larry Ellison $62.3B +$358M +$13.0B United States Technology
6 8 Carlos Slim $57.0B -$331M +$2.20B Mexico Diversified
7 9 Francoise Bettencourt Meyers $56.7B -$1.12B +$10.5B France Consumer
8 0 Larry Page $55.7B +$393M +$4.47B United States Technology
I used DataFrame.read_csv to read the file, since `Series.from_csv' is deprecated.
I found that converting to a numpy array was far easier than I had realized - the numpy asarray method can handle a df (and conveniently enough it works for general objects, not just numbers)
df = pd.read_csv('billionaires',sep='\n')
print(df.shape)
-> (3507, 1)
n = np.asarray(df)
m = np.reshape(n,[-1,7])
df2=pd.DataFrame(m)
df2.head()
0 1 2 3 4 \
0 0 Name Total net worth $ Last change $ YTD change
1 1 Jeff Bezos $121B +$231M -$3.94B
2 2 Bill Gates $107B -$421M +$16.7B
3 3 Bernard Arnault $104B +$127M +$35.7B
4 4 Warren Buffett $84.9B +$66.3M +$1.11B
5 6
0 Country Industry
1 United States Technology
2 United States Technology
3 France Consumer
4 United States Diversified

Python - function similar to VLOOKUP (Excel)

i am trying to join two data frames but cannot get my head around the possibilities Python has to offer.
First dataframe:
ID MODEL REQUESTS ORDERS
1 Golf 123 4
2 Passat 34 5
3 Model 3 500 8
4 M3 5 0
Second dataframe:
MODEL TYPE MAKE
Golf Sedan Volkswagen
M3 Coupe BMW
Model 3 Sedan Tesla
What I want is to add another column in the first dataframe called "make" so that it looks like this:
ID MODEL MAKE REQUESTS ORDERS
1 Golf Volkswagen 123 4
2 Passat Volkswagen 34 5
3 Model 3 Tesla 500 8
4 M3 BMW 5 0
I already looked at merge, join and map but all examples just appended the required information at the end of the dataframe.
I think you can use insert with map by Series created with df2 (if some value in column MODEL in df2 is missing get NaN):
df1.insert(2, 'MAKE', df1['MODEL'].map(df2.set_index('MODEL')['MAKE']))
print (df1)
ID MODEL MAKE REQUESTS ORDERS
0 1 Golf Volkswagen 123 4
1 2 Passat NaN 34 5
2 3 Model 3 Tesla 500 8
3 4 M3 BMW 5 0
Although not in this case, but there might be scenarios where df2 has more than two columns and you would just want to add one out of those to df1 based on a specific column as key. Here is a generic code that you may find useful.
df = pd.merge(df1, df2[['MODEL', 'MAKE']], on = 'MODEL', how = 'left')
The join method acts very similarly to a VLOOKUP. It joins a column in the first dataframe with the index of the second dataframe so you must set MODEL as the index in the second dataframe and only grab the MAKE column.
df.join(df1.set_index('MODEL')['MAKE'], on='MODEL')
Take a look at the documentation for join as it actually uses the word VLOOKUP.
I always found merge to be an easy way to do this:
df1.merge(df2[['MODEL', 'MAKE']], how = 'left')
However, I must admit it would not be as short and nice if you wanted to call the new column something else than 'MAKE'.

Filling in a pandas column based on existing number of strings

I have a pandas data-frame that looks like this:
ID Hobbby Name
1 Travel Kevin
2 Photo Andrew
3 Travel Kevin
4 Cars NaN
5 Photo Andrew
6 Football NaN
.............. 1303 rows.
The number of Names filled in might be large then 2 as well. I would like to end up the entire Names column filled n equally into the names ( or+1 in the case of even number of rows). I already store into a variable number of names the total number of names. In the above case it's 2. I tried filtering and counting by each name but I don't know how to make this when the number of name is dynamic.
Expected Dataframe:
ID Hobbby Name
1 Travel Kevin
2 Photo Andrew
3 Travel Kevin
4 Cars Kevin
5 Photo Andrew
6 Football Andrew
I tried: replace NaN with 0 in Column Name using fillna. Filter the column and end up with a dataframe that has only the na fields and afterwards len(df) to get the number of nan and from here created 2 databases each containing half of the df. Bu I think this approach is completely wrong as I do not always have 2 Names. There could be2,3,4 etc. ( this is given by a dictionary)
Any help highly appreciated
Thanks.
It's difficult to tell but I think you need ffill
df['Name'] = df['Name'].ffill()

Categories