Extract the first number from a string number range - python

I have a dataset with price column as type of string, and some of the values in the form of range (15000-20000).
I want to extract the first number and convert the entire column to integers.
I tried this :
df['ptice'].apply(lambda x:x.split('-')[0])
The code just return the original column.

Try one of the following options:
Data
import pandas as pd
data = {'price': ['0','100-200','200-300']}
df = pd.DataFrame(data)
print(df)
price
0 0 # adding a str without `-`, to show that this one will be included too
1 100-200
2 200-300
Option 1
Use Series.str.split with expand=True and select the first column from the result.
Next, chain Series.astype, and assign the result to df['price'] to overwrite the original values.
df['price'] = df.price.str.split('-', expand=True)[0].astype(int)
print(df)
price
0 0
1 100
2 200
Option 2
Use Series.str.extract with a regex pattern, r'(\d+)-?':
\d matches a digit.
+ matches the digit 1 or more times.
match stops when we hit - (? specifies "if present at all").
data = {'price': ['0','100-200','200-300']}
df = pd.DataFrame(data)
df['price'] = df.price.str.extract(r'(\d+)-?').astype(int)
# same result

Here is one way to do this:
df['price'] = df['price'].str.split('-', expand=True)[0].astype('int')
This will only store first number from the range. Example: From 15000-20000 only 15000 will be stored in the price column.

Related

How to transform number that include alphabet to a new numerical order in pandas columns?

I am trying to convert some columns that include number and alphabet at the same time.
I want to convert to simple numerical value such as int.
I don't want to just transform randomly because they are isbn involved. I want to able to match again.
I want to transform that matching as below.
d = {'id': [156972, 102154, 214717, 84897, 275220, 165759, 42099, 265749, 130474, 15822],
'isbn': ['2842630521', '037570745X','689710879','783547528','786014091','1561581747','553571095','451210220','034540288X','345832710X',]}
df = pd.DataFrame(data=d)
I want to transform this way.
for id
2842630521 as 0
783547528 as 1
.
.
.
130474 as 9
15822 as 10
for isbn
2842630521 as 0
037570745X as 1 # alphabet involved
.
.
.
034540288X as 9
345832710X as 10
Is there anyway to do this to column?
Optional: I would also like a way to track back if the number 0 in isbn column to 2842630521.
Use:
#random shuffle order of rows
df1 = df.sample(frac=1)
#create mapping dictionary by id
a, b = pd.factorize(df1['id'])
d = dict(zip(b, a))
#map original column
df['id'] = df['id'].map(d)

How to convert rows into columns (as value but not header) in Python

In the following dataset, I need to convert each row for the “description” under “name" column (for example, inventory1, inventory2 and inventory3) into two separate columns (namely description1 and description2, respectively). If I used either pviot_table or groupby, the value of the description will become header instead of a value under a column. What would be the way to generate the desired output? Thanks
import pandas as pd
df1 = { 'item':['item1','item2','item3','item4','item5','item6'],
'name':['inventory1','inventory1','inventory2','inventory2','inventory3','inventory3'],
'code':[1,1,2,2,3,3],
'description':['sales number decrease compared to last month', 'Sales number
decreased','sales number increased','Sales number increased, need to keep kpi','no sales this
month','item out of stock']}
df1=pd.DataFrame(df1)
desired output as below:
You can actually use pd.concat:
new_df = pd.concat([
(
df.drop_duplicates('name')
.drop('description', axis=1)
.reset_index(drop=True)
),
(
pd.DataFrame([pd.Series(l) for l in df.groupby('name')['description'].agg(list).tolist()])
.add_prefix('description')
),
],
axis=1)
Output:
>>> new_df
item name code description0 description1
0 item1 inventory1 1 sales number decrease compared to last month Sales number decreased
1 item3 inventory2 2 sales number increased Sales number increased, need to keep kpi
2 item5 inventory3 3 no sales this month item out of stock
One-liner version of the above, in case you want it:
pd.concat([df.drop_duplicates('name').drop('description', axis=1).reset_index(drop=True), pd.DataFrame([pd.Series(l) for l in df.groupby('name')['description'].agg(list).tolist()]).add_prefix('description')], axis=1)

Splitting row values and count unique's from a DataFrame

I have the following data in a column titled Reference:
ABS052
ABS052/01
ABS052/02
ADA010/00
ADD005
ADD005/01
ADD005/02
ADD005/03
ADD005/04
ADD005/05
...
WOO032
WOO032/01
WOO032/02
WOO032/03
WOO045
WOO045/01
WOO045/02
WOO045/03
WOO045/04
I would like to know how to split the row values to create a Dataframe that contains the single Reference code, plus a Count value, for example:
Reference
Count
ABS052
3
ADA010
0
ADD005
2
...
...
WOO032
3
WOO045
4
I have the following code:
df['Reference'] = df['Reference'].str.split('/')
Results in:
['ABS052'],
['ABS052','01'],
['ABS052','02'],
['ABS052','03'],
...
But I'm not sure how to ditch the last two digits from the list in each row.
All I want now is to keep the string in each row [0] if that makes sense, then I could just retrieve a value_count from the 'Reference' column.
There seems to be something wrong with the expected result listed in the question.
Let's say you want to ditch the digits and count the prefix occurrences:
df.Reference.str.split("/", expand=True)[0].value_counts()
If instead the suffix means something and you want to keep the highest value this should do
df.Reference.str.split("/", expand=True).fillna("00").astype({0: str, 1: int}).groupby(0).max()
You can just use regex to replace the last two digits like this:
df = pd.DataFrame({'a':['ABS052','ABS052/01','ABS052/02','ADA010/00','ADD005','ADD005/01','ADD005/02','ADD005/03','ADD005/04','ADD005/05']})
df = df['a'].str.replace(r'\/\d+$', '').value_counts().reset_index()
Output:
>>>> index a
0 ADD005 6
1 ABS052 3
2 ADA010 1
You are almost there, you can add expand=True to split and then use groupby:
df['Reference'].str.split("/", expand=True).fillna("--").groupby(0).count()
returns:
1
0
ABS052 3
ADA010 1
ADD005 6
for the first couple of rows of your data.
The fillna("--") makes sure you also count lines like ABS052 without a "", i.e. None in the second column.
Output to df with column names
df['Reference'] = df['Reference'].str.split('/').str[0]
df_counts = df['Reference'].value_counts().rename_axis('Reference').reset_index(name='Counts')
output
Reference Counts
0 ADD005 6
1 ABS052 3
2 ADA010 1
Explanation - The first line gives a clean series called 'Reference'. The second line gives a count of unique items and then resets the index and renames the columns.

How to display grouped variable count in each row

I want to show the count of a given rating for a given month in each rating's row for the data below. This would mean that on rows 0 and 3 there would be a 2 since there are two 10 ratings given in month 1.
test = {'Rating': [10,9,8,10,8,6,4,3,0,7,2,5], 'Month': [1,2,3,1,3,2,1,2,3,1,2,3]}
test_df = pd.DataFrame(data=test)
I have tried the following but it didn't help much:
test_df['Rating_totals'] = test_df.groupby(['Month'])['Rating'].count()
Is there a way to do this?
Use value_counts():
test_df.value_counts()
To sort the results by month and rating:
total_rankings = test_df[['Month', 'Rating']].value_counts().sort_index()
Use pandas apply to add the total count for each row as a new column to test_df:
test_df['total_rankings'] = test_df.apply(lambda row: total_rankings.loc[row['Month', row['Ratings']], axis=1)

Python - str.match for each string in a dataframe

I'm trying to use str.match to match a phrase exactly, but for each word in each row's string. I want to return the row's index number for the correct row, which is why I'm using str.match instead of regex.
I want to return the index for the row that contains exactly 'FL', not 'FLORIDA'. The problem with using str.contains though, is that it returns to me the index of the row with 'FLORIDA'.
import pandas as pd
data = [['Alex in FL','ten'],['Bob in FLORIDA','five'],['Will in GA','three']]
df = pd.DataFrame(data,columns=['Name','Age'])
df.index[df['Name'].str.contains('FL')]
df.index[df['Name'].str.match('FL')]
Here's what the dataframe looks like:
Name Age
0 Alex in FL ten
1 Bob in FLORIDA five
2 Will in GA three
The output should be returning the index of row 0:
Int64Index([0], dtype='int64')
Use contains with word boundaries:
import pandas as pd
data = [['Alex in FL','ten'],['Bob in FLORIDA','five'],['Will in GA','three']]
df = pd.DataFrame(data,columns=['Name','Age'])
print(df.index[df['Name'].str.contains(r'\bFL\b')])
Output
Int64Index([0], dtype='int64')
Try:
df[df.Name.str.contains(r'\bFL\b', regex=True)]
OR
df[['FL' in i for i in df.Name.str.split('\s')]]
Output:
Name Age
0 Alex in FL ten
The docs say that it's matching Regex with the expression ("FL" in your case). Since "FLORIDA" does contain that substring, it does match.
One way you could do this would be to match instead for " FL " (padded with space) but you would also need to pad each of the values with spaces as well (for when "FL" is the end of the string).

Categories