good evening.
I would like help with how to get information between two strings in pandas (python).
Imagine that I have a database of car prices for each car dealer in which each cell had a text similar to this (note: the car dealer column can be the index of each row):
"1 - Ford (1) - new:R$60000 / used:R$30000 sedan car 2 - Mercedes -
Benz (1) - new:R$130000 / used:R$95000 silver sedan car 3 - Chery
(caoa) (1) - new:R$80000 / used:R$60000 SUV car 5 - Others (1) -
new:R$90000 / used:R$75500 hatch car"
Thanks for the help!!
Three step process
split whole string into line items using re.split()
parse out constituent parts of line using pandas extract
finally shape dataframe as wide...
import re
import pandas as pd
import numpy as np
s = "1 - Ford (1) - new:R$60000 / used:R$30000 sedan car 2 - Mercedes - Benz (1) - new:R$130000 / used:R$95000 silver sedan car 3 - Chery (caoa) (1) - new:R$80000 / used:R$60000 SUV car 5 - Others (1) - new:R$90000 / used:R$75500 hatch car"
# there as a car after "<n> - ". split into lines
df = pd.DataFrame(re.split("[ ]?[0-9] - ", s)).replace("", np.nan).dropna()
# parse out each of the strings
df = df[0].str.extract("(?P<car>.*) \([0-9]\) - new:R\$(?P<new>[0-9]*) \/ used:R\$(?P<used>[0-9]*).*")
# finally format as wide format...
df = (df.melt().assign(car=lambda dfa: dfa.groupby("variable").cumcount(),
col=lambda dfa: dfa.variable + (dfa.car+1).astype(str))
.drop(columns=["variable","car"])
.set_index("col")
.T
)
car1
car2
car3
car4
new1
new2
new3
new4
used1
used2
used3
used4
value
Ford
Mercedes - Benz
Chery (caoa)
Others
60000
130000
80000
90000
30000
95000
60000
75500
You could use extractall to get a multiIndex dataframe that in summary will contain the dealer, the car number and the values extracted from regex named groups. After extractall, use stack to reshape the dataframe and the inner-most level index, this will allow you to set a new index with the format [(dealer, carN)...] and subsequently groupby that same first index level to keep the capturing order. Append each dealer data into a list and create the dataframe.
import pandas as pd
import re
df = pd.DataFrame(
["1 - Ford (1) - new:R$60000 / used:R$30000 sedan car 2 - Mercedes - Benz (1) - new:R$130000 / used:R$95000 silver sedan car 3 - Chery (caoa) (1) - new:R$80000 / used:R$60000 SUV car 5 - Others (1) - new:R$90000 / used:R$75500 hatch car",
"2 - Toyota (1) - new:R$10543 / used:R$9020 silver sedan car",
"3 - Honda (1) - new:R$123600 / used:R$34400 sedan car 2 - Fiat (1) - new:R$1955 / used:R$877 silver sedan car 3 - Cadillac (1) - new:R$174500 / used:R$12999 SUV car"])
regex = re.compile(
r"\d\s-\s(?P<car>.*?)(?:\s\(\d+\)?\s)-\s"
r"new:R\$(?P<new>[\d\.\,]+)\s/\s"
r"used:R\$(?P<used>[\d\.\,]+).*?car"
)
df_out = df[0].str.extractall(regex).stack()
df_out.index = [df_out.index.get_level_values(0), \
df_out.index.map(lambda x: f'{x[2]+str(x[1]+1)}')]
dealers = []
for n, g in df_out.groupby(level=0):
dealers.append(g.droplevel(0))
df1 = pd.DataFrame(dealers).rename_axis('Dealer')
print(df1)
Output from df1
car1 new1 used1 car2 new2 used2 car3 new3 used3 car4 new4 used4
Dealer
0 Ford 60000 30000 Mercedes - Benz 130000 95000 Chery (caoa) 80000 60000 Others 90000 75500
1 Toyota 10543 9020 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 Honda 123600 34400 Fiat 1955 877 Cadillac 174500 12999 NaN NaN NaN
Related
I'm running some analysis on bank statements (csv's). Some items like McDonalds each have their own row (due to having different addresses).
I'm trying to combine these rows by a common phrase. So for this example the obvious phrase, or string, would be "McDonalds". I think it'll be an if statement.
Also, the column has a dtype of "object". Will I have to convert it to string format?
Here is an example output of the result of printingtotali = df.Item.value_counts() from my code.
Ideally I'd want that line to output McDonalds as just a single row.
In the csv they are 2 separate rows.
foo 14
Restaurant Boulder CO 8
McDonalds Boulder CO 5
McDonalds Denver CO 5
Here's what the column data consists of
'Sukiya Greenwood Vil CO' 'Sei 34179 Denver CO' 'Chambers Place Liquors 303-3731100 CO' "Mcdonald's F26593 Fort Collins CO" 'Suh Sushi Korean Bbq Fort Collins CO' 'Conoco - Sei 26927 Fort Collins CO'
OK. I think I ginned up something that can be helpful. Realize that the task of inferring categories or names from text strings can be huge, depending on how detailed you want to get. You can dive into regex or other learning models. People make careers of it! Obviously, your bank is doing some of this as they categorize things when you get a year-end summary.
Anyhow, here is a simple way to generate some categories and use them as a basis for the grouping that you want to do.
import pandas as pd
item=['McDonalds Denver', 'Sonoco', 'ATM Fee', 'Sonoco, Ft. Collins', 'McDonalds, Boulder', 'Arco Boulder']
txn = [12.44, 4.00, 3.00, 14.99, 19.10, 52.99]
df = pd.DataFrame([item, txn]).T
df.columns = ['item_orig', 'charge']
print(df)
# let's add an extra column to catch the conversions...
df['item'] = pd.Series(dtype=str)
# we'll use the "contains" function in pandas as a simple converter... quick demo
temp = df.loc[df['item_orig'].str.contains('McDonalds')]
print('\nitems that containt the string "McDonalds"')
print(temp)
# let's build a simple conversion table in a dictionary
conversions = { 'McDonalds': 'McDonalds - any',
'Sonoco': 'gas',
'Arco': 'gas'}
# let's loop over the orig items and put conversions into the new column
# (there is probably a faster way to do this, but for data with < 100K rows, who cares.)
for key in conversions:
df['item'].loc[df['item_orig'].str.contains(key)] = conversions[key]
# see how we did...
print('converted...')
print(df)
# now move over anything that was NOT converted
# in this example, this is just the ATM Fee item...
df['item'].loc[df['item'].isnull()] = df['item_orig']
# now we have decent labels to support grouping!
print('\n\n *** sum of charges by group ***')
print(df.groupby('item')['charge'].sum())
Yields:
item_orig charge
0 McDonalds Denver 12.44
1 Sonoco 4
2 ATM Fee 3
3 Sonoco, Ft. Collins 14.99
4 McDonalds, Boulder 19.1
5 Arco Boulder 52.99
items that containt the string "McDonalds"
item_orig charge item
0 McDonalds Denver 12.44 NaN
4 McDonalds, Boulder 19.1 NaN
converted...
item_orig charge item
0 McDonalds Denver 12.44 McDonalds - any
1 Sonoco 4 gas
2 ATM Fee 3 NaN
3 Sonoco, Ft. Collins 14.99 gas
4 McDonalds, Boulder 19.1 McDonalds - any
5 Arco Boulder 52.99 gas
*** sum of charges by group ***
item
ATM Fee 3.00
McDonalds - any 31.54
gas 71.98
Name: charge, dtype: float64
I am trying to generate dummy and categorical variables from a text column in a dataframe, using Python. Imagine a text column 'Cars_notes' in a dataframe named 'Cars_listing':
- "This Audi has ABS braking, leather interior and bucket seats..."
- "The Ford F150 is one tough pickup truck, with 4x4, new suspension and club cab..."
- "Our Nissan Sentra comes with ABS brakes, Bluetooth-enabled radio..."
- "This Toyota Corolla is a gem, with new tires, low miles, a few scratches..."
- "The Renault Le Car has been sitting in the garage, a little rust..."
- "The Kia Sorento for sale has a CD player, new tires..."
- "Red Dodge Viper convertible for sale, ceramic brakes, low miles..."
How to make new variables:
- car_type: American [Ford] (1), European [Audi, Renault] (2), Asian [Toyota, Kia] (3)
- ABS_brakes: description includes 'ABS brak' (1), or not (0)
- imperfection: description includes 'rust' or 'scratches' (1) or not (0)
- sporty: description includes 'convertible' (1) or not (0)
I have started by trying re.search() (not re.match()), such as:
sporty = re.search("convertible",'Cars_notes')
I am just starting to learn Python text manipulation and NLP. I have searched for information here as well as other sources (Data Camp, Udemy, Google searching) but I have not yet found something to explain how to manipulate text to create such categorical or dummy variables. Help will be appreciated. Thanks!
Here's my take on this.
Since you're dealing with text, pandas.Series.str.contains should be plenty (no need to use re.search.
np.where and np.select are useful when it comes to assigning new variables based on conditions.
import pandas as pd
import numpy as np
Cars_listing = pd.DataFrame({
'Cars_notes':
['"This Audi has ABS braking, leather interior and bucket seats..."',
'"The Ford F150 is one tough pickup truck, with 4x4, new suspension and club cab..."',
'"Our Nissan Sentra comes with ABS brakes, Bluetooth-enabled radio..."',
'"This Toyota Corolla is a gem, with new tires, low miles, a few scratches..."',
'"The Renault Le Car has been sitting in the garage, a little rust..."',
'"The Kia Sorento for sale has a CD player, new tires..."',
'"Red Dodge Viper convertible for sale, ceramic brakes, low miles..."']
})
# 1. car_type
Cars_listing['car_type'] = np.select(
condlist=[ # note you could use the case-insensitive search with `case=False`
Cars_listing['Cars_notes'].str.contains('ford', case=False),
Cars_listing['Cars_notes'].str.contains('audi|renault', case=False),
Cars_listing['Cars_notes'].str.contains('Toyota|Kia')
],
choicelist=[1, 2, 3], # dummy variables
default=0 # you could set it to `np.nan` etc
)
# 2. ABS_brakes
Cars_listing['ABS_brakes'] = np.where(# where(condition, [x, y])
Cars_listing['Cars_notes'].str.contains('ABS brak'), 1, 0)
# 3. imperfection
Cars_listing['imperfection'] = np.where(
Cars_listing['Cars_notes'].str.contains('rust|scratches'), 1, 0)
# 4. sporty
Cars_listing['sporty'] = np.where(
Cars_listing['Cars_notes'].str.contains('convertible'), 1, 0)
Cars_notes car_type ABS_brakes imperfection sporty
0 """This Audi has ..." 2 1 0 0
1 """The Ford F150 ..." 1 0 0 0
2 """Our Nissan Sen..." 0 1 0 0
3 """This Toyota Co..." 3 0 1 0
4 """The Renault Le..." 2 0 1 0
5 """The Kia Sorent..." 3 0 0 0
6 """Red Dodge Vipe..." 0 0 0 1
I'm trying to make a function that prints the top 5 products and their prices, and the bottom 5 products and their prices of the product listings that contain words from a wordlist. I've tried making it like this -
def wordlist_top_costs(filename, wordlist):
xlsfile = pd.ExcelFile(filename)
dframe = xlsfile.parse('Sheet1')
dframe['Product'].fillna('', inplace=True)
dframe['Price'].fillna(0, inplace=True)
price = {}
for word in wordlist:
mask = dframe.Product.str.contains(word, case=False, na=False)
price[mask] = dframe.loc[mask, 'Price']
top = sorted(Score.items(), key=operator.itemgetter(1), reverse=True)
print("Top 10 product prices for: ", wordlist.name)
for i in range(0, 5):
print(top[i][0], " | ", t[i][1])
bottom = sorted(Score.items(), key=operator.itemgetter(1), reverse=False)
print("Bottom 10 product prices for: ", wordlist.name)
for i in range(0, 5):
print(top[i][0], " | ", t[i][1])
However, the above function throws an error at line
price[mask] = dframe.loc[mask, 'Price in AUD'] that says -
TypeError: 'Series' objects are mutable, thus they cannot be hashed
Any help to correct/modify this appreciated. Thanks!
Edit -
For eg.
wordlist - alu, co, vin
Product | Price
Aluminium Crown - 22.20
Coca Cola - 1.0
Brass Box - 28.75
Vincent Kettle - 12.00
Vinyl Stickers - 0.50
Doritos - 2.0
Colin's Hair Oil - 5.0
Vincent Chase Sunglasses - 75.40
American Tourister - $120.90
Output :
Top 3 Product Prices:
Vincent Chase Sunglasses - 75.40
Aluminium Crown - 22.20
Vincent Kettle - 12.0
Bottom 3 Product Prices:
Vinyl Stickers - 0.50
Coca Cola - 1.0
Colin's Hair Oil - 5.0
You can use nlargest and
nsmallest:
#remove $ and convert column Price to floats
dframe['Price'] = dframe['Price'].str.replace('$', '').astype(float)
#filter by regex - joined all values of list by |
wordlist = ['alu', 'co', 'vin']
pat = '|'.join(wordlist)
mask = dframe.Product.str.contains(pat, case=False, na=False)
dframe = dframe.loc[mask, ['Product','Price']]
top = dframe.nlargest(3, 'Price')
#top = dframe.sort_values('Price', ascending=False).head(3)
print (top)
Product Price
7 Vincent Chase Sunglasses 75.4
0 Aluminium Crown 22.2
3 Vincent Kettle 12.0
bottom = dframe.nsmallest(3, 'Price')
#bottom = dframe.sort_values('Price').head(3)
print (bottom)
Product Price
4 Vinyl Stickers 0.5
1 Coca Cola 1.0
6 Colin's Hair Oil 5.0
Setup:
dframe = pd.DataFrame({'Price': ['22.20', '1.0', '28.75', '12.00', '0.50', '2.0', '5.0', '75.40', '$120.90'], 'Product': ['Aluminium Crown', 'Coca Cola', 'Brass Box', 'Vincent Kettle', 'Vinyl Stickers', 'Doritos', "Colin's Hair Oil", 'Vincent Chase Sunglasses', 'American Tourister']}, columns=['Product','Price'])
print (dframe)
Product Price
0 Aluminium Crown 22.20
1 Coca Cola 1.0
2 Brass Box 28.75
3 Vincent Kettle 12.00
4 Vinyl Stickers 0.50
5 Doritos 2.0
6 Colin's Hair Oil 5.0
7 Vincent Chase Sunglasses 75.40
8 American Tourister $120.90
I have a dataframe topic_data that contains the output of an LDA topic model:
topic_data.head(15)
topic word score
0 0 Automobile 0.063986
1 0 Vehicle 0.017457
2 0 Horsepower 0.015675
3 0 Engine 0.014857
4 0 Bicycle 0.013919
5 1 Sport 0.032938
6 1 Association_football 0.025324
7 1 Basketball 0.020949
8 1 Baseball 0.016935
9 1 National_Football_League 0.016597
10 2 Japan 0.051454
11 2 Beer 0.032839
12 2 Alcohol 0.027909
13 2 Drink 0.019494
14 2 Vodka 0.017908
This shows the top 5 terms for each topic, and the score (weight) for each. What I'm trying to do is reformat so that the index is the rank of the term, the columns are the topic IDs, and the values are formatted strings generated from the word and score columns (something along the lines of "%s (%.02f)" % (word,score)). That means the new dataframe should look something like this:
Topic 0 1 ...
Rank
0 Automobile (0.06) Sport (0.03) ...
1 Vehicle (0.017) Association_football (0.03) ...
... ... ... ...
What's the right way of going about this? I assume it involves a combination of index-setting, unstacking, and ranking, but I'm not sure of the right approach.
It would be something like this, note that Rank has to be generated first:
In [140]:
df['Rank'] = (-1*df).groupby('topic').score.transform(np.argsort)
df['New_str'] = df.word + df.score.apply(' ({0:.2f})'.format)
df2 = df.sort(['Rank', 'score'])[['New_str', 'topic','Rank']]
print df2.pivot(index='Rank', values='New_str', columns='topic')
topic 0 1 2
Rank
0 Automobile (0.06) Sport (0.03) Japan (0.05)
1 Vehicle (0.02) Association_football (0.03) Beer (0.03)
2 Horsepower (0.02) Basketball (0.02) Alcohol (0.03)
3 Engine (0.01) Baseball (0.02) Drink (0.02)
4 Bicycle (0.01) National_Football_League (0.02) Vodka (0.02)
I am new to python so please excuse me for my question. In my line of work I have to work with tabular data represented in text files. The values are separated by either a coma or semi colon. The simplified example of such file might look as following:
City;Car model;Color;Registration number
Moscow;Mercedes;Red;1234
Moscow;Mercedes;Red;2345
Kiev;Toyota;Blue;3423
London;Fiat;Red;4545
My goal is to have a script which can tell me how many Mercedes are in Moscow (in our case there are two) and save a new text file Moscow.txt with following
Moscow;Mercedes;Red;1234
Moscow;Mercedes;Red;2345
I will be very thankful for your help.
I would recommend looking into the pandas library. You can do all sorts of neat manipulations of tabular data. First read it in:
>>> import pandas as pd
>>> df = pd.read_csv("cars.ssv", sep=";")
>>> df
City Car model Color Registration number
0 Moscow Mercedes Red 1234
1 Moscow Mercedes Red 2345
2 Kiev Toyota Blue 3423
3 London Fiat Red 4545
Index it in different ways:
>>> moscmerc = df[(df["City"] == "Moscow") & (df["Car model"] == "Mercedes")]
>>> moscmerc
City Car model Color Registration number
0 Moscow Mercedes Red 1234
1 Moscow Mercedes Red 2345
>>> len(moscmerc)
2
Write it out:
>>> moscmerc.to_csv("moscmerc.ssv", sep=";", header=None, index=None)
>>> !cat moscmerc.ssv
Moscow;Mercedes;Red;1234
Moscow;Mercedes;Red;2345
You can also work on multiple groups at once:
>>> df.groupby(["City", "Car model"]).size()
City Car model
Kiev Toyota 1
London Fiat 1
Moscow Mercedes 2
Dtype: int64
Update: #Anthon pointed out that the above only handles the case of a semicolon separator. If a file has a comma throughout, then you can just use , instead of ;, so that's trivial. The more interesting case is if the delimiter is inconsistent within the file, but that's easily handled too:
>>> !cat cars_with_both.txt
City;Car model,Color;Registration number
Moscow,Mercedes;Red;1234
Moscow;Mercedes;Red;2345
Kiev,Toyota;Blue,3423
London;Fiat,Red;4545
>>> df = pd.read_csv("cars_with_both.txt", sep="[;,]")
>>> df
City Car model Color Registration number
0 Moscow Mercedes Red 1234
1 Moscow Mercedes Red 2345
2 Kiev Toyota Blue 3423
3 London Fiat Red 4545
Update #2: and now the text is in Russian -- of course it is. :^) Still, if everything is correctly encoded, and your terminal is properly configured, that should work too:
>>> df = pd.read_csv("russian_cars.csv", sep="[;,]")
>>> df
City Car model Color Registration number
0 Москва Mercedes красный 1234
1 Москва Mercedes красный 2345
2 Киев Toyota синий 3423
3 Лондон Fiat красный 4545