How to parse all the values in a column of a DataFrame? - python

DataFrame df has a column called amount
import pandas as pd
df = pd.DataFrame(['$3,000,000.00','$3,000.00', '$200.5', '$5.5'], columns = ['Amount'])
df:
ID | Amount
0 | $3,000,000.00
1 | $3,000.00
2 | $200.5
3 | $5.5
I want to parse all the values in column amount and extract the amount as a number and ignore the decimal points. End result is DataFrame that looks like this:
ID | Amount
0 | 3000000
1 | 3000
2 | 200
3 | 5
How do I do this?

You can use str.replace with double casting by astype:
df['Amount'] = (df.Amount.str.replace(r'[\$,]', '').astype(float).astype(int))
print (df)
Amount
0 3000000
1 3000
2 200
3 5

You need to use the map function on the column and reassign to the same column:
import locale
locale.setlocale( locale.LC_ALL, 'en_US.UTF-8' )
df.Amount = df.Amount.map(lambda s: int(locale.atof(s[1:])))
PS: This uses the code from How do I use Python to convert a string to a number if it has commas in it as thousands separators? to convert a string representing a number with thousands separator to an int

Code -
import pandas as pd
def format_amount(x):
x = x[1:].split('.')[0]
return int(''.join(x.split(',')))
df = pd.DataFrame(['$3,000,000.00','$3,000.00', '$200.5', '$5.5'], columns =
['Amount'])
df['Amount'] = df['Amount'].apply(format_amount)
print(df)
Output -
Amount
0 3000000
1 3000
2 200
3 5

Related

Python Dataframe: a str has numbers and letters, I want to remove the letters and multiply the remaining numbers by 1,000,000 [duplicate]

This question already has answers here:
Convert the string 2.90K to 2900 or 5.2M to 5200000 in pandas dataframe
(6 answers)
Closed 1 year ago.
I have a dataframe that contains values like:
|column a|
---------
|3.5M+ |
|100,000 |
|214,123 |
|1.25M+ |
I want to convert values like 3.5M+ to 3,500,000
I've tried:
regex1 = r'.+M+'
for i in df.a:
b = re.match(regex1, i)
if b is not None:
i = int(np.double(b.string.removesuffix('M+'))*1000000)
else:
i = i.replace(',','')
if I add print statements through out, it looks like it's iterating correctly. Unforunately, the changes are not saved to the dataframe.
>>> import pandas as pd
>>> df = pd.DataFrame({'column_a' : ['3.5M+', '100,000', '214,123', '1.25M+']})
>>> df
column_a
0 3.5M+
1 100,000
2 214,123
3 1.25M+
>>> df.column_a = df.column_a.str.replace("M\+", '*1000000').str.replace(",", '').apply(eval)
>>> df
column_a
0 3500000.0
1 100000.0
2 214123.0
3 1250000.0

multiple columns to single datetime dataframe column

I have a data frame that contains (among others) columns for the time of day (00:00-23:59:59) day (1-7), month (1-12), and year (2000-2019). How can I combine the values of each of these columns on a row by row basis into a new DateTime object and then store these new date-times in a new column? I've read other posts pertaining to such a task but they all seem to involve one date column to one DateTime column whereas I have 4 columns that need to be transformed into DateTime. Any help is appreciated!
e.g.
| 4:30:59 | 1 | 1 | 2000 | TO 200/1/1 4:30:59
this is the only code I have so far which probably doesn't do anything
#creating datetime object (MISC)
data = pd.read_csv('road_accidents_data_clean.csv',delimiter=',')
df = pd.DataFrame(data)
format = '%Y-%m-%d %H:%M:%S'
n = 0
df['datetime'] = data.loc[n,'Crash_Day'],data.loc[n,'Crash_Month'],data.loc[n,'Year']
My DataFrame is layed out as follows:
Index | Age | Year | Crash_Month | Crash_Day | Crash_Time | Road_User | Gender |
0 37 2000 1 1 4:30:59 DRIVER MALE
1 42 2000 1 1 7:45:10 DRIVER MALE
2 25 2000 1 1 10:15:30 PEDESTRIAN FEMALE
Crash_Type | Injury_Severity | Crash_LGA | Crash_Area_Type | Datetime |
UNKNOWN 1 YARRA MELBOURNE NaN
OVERTAKING 1 YARRA MELBOURNE NaN
ADJACENT DIR 0 MONASH MELBOURNE NaN
NOTE: the dataframe is 13 columns wide i just couldn't fit them all on one line so Crash_Type starts to the right of Gender.
below is the code i've been suggested to use/my adaptation of it
df = pd.DataFrame(dict(
Crash_Time=['4:30:59','4:20:00'],
Crash_Day=[1,20],
Crash_Month=[1,4],
Year=[2000,2020],
))
data['Datetime'] = df['Datetime']=pd.to_datetime(
np.sum([
df['Year'].astype(str),
'-',
df['Crash_Month'].astype(str),
'-',
df['Crash_Day'].astype(str),
' ',
df['Crash_Time'],
]),
format = '%Y-%m-%d %H:%M:%S',
)
I've adapted this code in order to combine the values for the datetime column with the my original dataframe.
Combine the columns into a single series of stings using + (converting to str where needed with pandas.Series.astype method) then pass that new series into pd.to_datetime before assigning it to a new column in your df:
import pandas as pd
df = pd.DataFrame(dict(time=['4:30:59'],date=[1],month=[1],year=[2000]))
df['datetime'] = pd.to_datetime(
df['year'].astype(str)+'-'+df['month'].astype(str)+'-'+df['date'].astype(str)+' '+df['time'],
format = '%Y-%m-%d %H:%M:%S',
)
print(df)
example in python tutor
edit: You can also use a numpy.sum to make that one long line adding columns together easier on the eyes:
import pandas as pd
import numpy as np
df = pd.DataFrame(dict(
time=['4:30:59','4:20:00'],
date=[1,20],
month=[1,4],
year=[2000,2020],
))
df['datetime']=pd.to_datetime(
np.sum([
df['year'].astype(str),
'-',
df['month'].astype(str),
'-',
df['date'].astype(str),
' ',
df['time'],
]),
format = '%Y-%m-%d %H:%M:%S',
)
sum example in python tutor
edit 2: Using your actual column names, it should be something like this:
import pandas as pd
import numpy as np
'''
Index | Age | Year | Crash_Month | Crash_Day | Crash_Time | Road_User | Gender |
0 37 2000 1 1 4:30:59 DRIVER MALE
Crash_Type | Injury_Severity | Crash_LGA | Crash_Area_Type | Datetime |
UNKNOWN 1 YARRA MELBOURNE NaN
'''
df = pd.DataFrame(dict(
Crash_Time=['4:30:59','4:20:00'],
Crash_Day=[1,20],
Crash_Month=[1,4],
Year=[2000,2020],
))
df['Datetime']=pd.to_datetime(
np.sum([
df['Year'].astype(str),
'-',
df['Crash_Month'].astype(str),
'-',
df['Crash_Day'].astype(str),
' ',
df['Crash_Time'],
]),
format = '%Y-%m-%d %H:%M:%S',
)
print(df)
another python tutor link
One thing to note is that you might want to double check if your csv file is separated by just a comma or could it be a comma and a space? possible that you may need to load the data with df = pd.read_csv('road_accidents_data_clean.csv',sep=', ') if there is an extra space separating the data in addition to the comma. You don't want to have that extra space in your data.

pandas: how to groupby / pivot retaining the NaNs? Converting float to str then back to float works but seems convoluted

I am tracking in which "month" a certain event has taken place. If it hasn't, the "month" field is a NaN. The starting table looks like this:
+-------+----------+---------+
| Month | Category | Balance |
+-------+----------+---------+
| 1 | a | 100 |
| nan | a | 300 |
| 2 | a | 200 |
+-------+----------+---------+
I am trying to build a crosstab like this:
+-------+----------------------------------+
| Month | Category a - cumulative % amount |
+-------+----------------------------------+
| 1 | 0.16 |
| 2 | 0.50 |
+-------+----------------------------------+
In month 1, the event has happened for 100/600, ie for 16%
In month 2, the event has happened, cumulatively, for (100 + 200) / 600 = 50%, where 100 is in month 1 and 200 in month 2.
My issue is with NaNs. Pandas automatically removes NaNs from any groupby / pivot / crosstab. I could convert the month field to string, so that grouping it won't remove the NaNs, but then pandas sorts by the month as if it were a string, ie it would sort: 10, 48, 5, 6.
Any suggestions?
The following works but seems extremely convoluted:
Convert "month" to string
Do a crosstab
Convert "month" back to float (can I do it without moving the index to a column, and then the column back to the index?)
Sort again
Do the cumsum
Code:
import numpy as np
import pandas as pd
df = pd.DataFrame()
mylen = int(10e3)
df['ix'] = np.arange(0,mylen)
df['amount'] = np.random.uniform(10e3,20e3,mylen)
df['category'] = np.where( df['ix'] <=4000, 'a','b' )
df['month'] = np.random.uniform(3,48,mylen)
df['month'] = np.where( df['ix'] <=1000, np.nan, df['month'] )
df['month rounded'] = np.ceil(df['month'])
ct = pd.crosstab(df['month rounded'].astype(str) , df['category'], \
values = df['amount'] ,aggfunc = 'sum', margins = True ,\
normalize = 'columns', dropna = False)
# the index is 'month rounded'
ct = ct.reset_index()
ct['month rounded'] = ct['month rounded'].astype('float32')
ct = ct.sort_values('month rounded')
ct = ct.set_index('month rounded')
ct2 = ct.cumsum (axis = 0)
Use:
new_df = df.assign(cumulative=df['Balance'].mask(df['Month'].isna())
.groupby(df['Category'])
.cumsum()
.div(df.groupby('Category')['Balance']
.transform('sum'))).dropna()
print(new_df)
Month Category Balance cumulative
0 1.0 a 100 0.166667
2 2.0 a 200 0.500000
If you want create a DataFrame for each Category you could create a dict:
df_category = {i:group for i,group in new_df.groupby('Category')}
df['Category a - cumulative % amount'] = (
df.groupby(by=df.Month.fillna(np.inf))
.apply(lambda x: x.Balance.cumsum().div(df.Balance.sum()))
.reset_index(level=0, drop=True)
)
df.dropna()
Month Category Balance Category a - cumulative % amount
0 1 a 100 0.166667
2 2 a 200 0.333333

How does one break out strings of multiple key value pairs in a single dataframe column into a new dataframe in python?

I am pulling data from a sql database into a pandas dataframe. The dataframe is a single column containing various quantities of key value pairs stored in a string. I would like to make a new dataframe that contains two columns, one holding the keys, and the other holding the values.
The dataframe looks like:
In[1]:
print(df.tail())
Out[1]:
WK_VAL_PAIRS
166 {('sloth', 0.073), ('animal', 0.034), ('gift', 0.7843)}
167 {('dabbing', 0.0863), ('gift', 0.7843)}
168 {('grandpa', 0.0156), ('funny', 1.3714), ('grandfather', 0.0015)}
169 {('nerd', 0.0216)}
170 {('funny', 1.3714), ('pineapple', 0.0107)}
Ideally, the new dataframe would look like:
0 | sloth | 0.073
1 | animal | 0.034
2 | gift | 0.07843
3 | dabbing | 0.0863
4 | gift | 0.7843
...
etc.
I have been successful in separating out the key value pairs from a single row into a dataframe, as show below. From here it will be trivial to split out the pairs into thier own columns.
In[2]:
def prep_text(row):
string = row.replace('{', '')
string = string.replace('}', '')
string = string.replace('\',', '\':')
string = string.replace(' ', '')
string = string.replace(')', '')
string = string.replace('(', '')
string = string.replace('\'', '')
return string
df['pairs'] = df['WK_VAL_PAIRS'].apply(prep_text)
dd = df['pairs'].iloc[166]
af = pd.DataFrame([dd.split(',') for x in dd.split('\n')])
af.transpose()
Out[2]:
0 sloth:0.073
1 animal:0.034
2 gift:0.7843
3 spirit:0.0065
4 fans:0.0093
5 funny:1.3714
However, I'm missing the leap to apply this transformation to the entire dataframe. Is there a way to do this with an .apply() style function, rather than a for each loop. What is the most pythonic way of handling this?
Any help would be appreciated.
Solution
With Chris's strong hint below, I was able to get to an adequate solution for my needs:
def prep_text(row):
string = row.replace('\'', '')
string = '"'+ string + '"'
return string
kvp_df = pd.DataFrame(
re.findall(
'(\w+), (\d.\d+)',
df['WK_VAL_PAIRS'].apply(prep_text).sum()
)
)
Try re.findall with pandas.DataFrame:
import pandas as pd
import re
s = pd.Series(["{(stepper, 0.0001), (bob, 0.0017), (habitual, 0.0), (line, 0.0097)}",
"{(pete, 0.01), (joe, 0.0019), (sleep, 0.0), (cline, 0.0099)}"])
pd.DataFrame(re.findall('(\w+), (\d.\d+)', s.sum()))
Output:
0 1
0 stepper 0.0001
1 bob 0.0017
2 habitual 0.0
3 line 0.0097
4 pete 0.01
5 joe 0.0019
6 sleep 0.0
7 cline 0.0099

group_by with Impala Ibis

I have an Impala table that I'd like to query using Ibis. The table looks like the following:
id | timestamp
-------------------
A | 5
A | 7
A | 3
B | 9
B | 5
I'd like to group_by this table according to unique combinations of id and timestamp range. The grouping operation should ultimately produce a single grouped object that I can then apply aggregations on. For example:
group1 conditions: id == A; 4 < timestamp < 11
group2 conditions: id == A; 1 < timestamp < 6
group3 conditions: id == B; 4 < timestamp < 7
yielding a grouped object with the following groups:
group1:
id | timestamp
-------------------
A | 5
A | 7
group2:
id | timestamp
-------------------
A | 5
A | 3
group3:
id | timestamp
-------------------
B | 5
Once I have the groups I'll perform various aggregations to get my final results. If anybody could help me figure this group_by out it would be greatly appreciated, even a regular pandas expression would be helpful!
So here is an example for groupby (no underscore):
df = pd.DataFrame({"id":["a","b","a","b","c","c"], "timestamp":[1,2,3,4,5,6]})
create a grouper column for your timestamp.
df["my interval"] = (df["timestamp"] > 3 )& (df["timestamp"] <5)
"you need some _data_ columns, i.e. those which you do not use for grouping"
df["dummy"] = 1
df.groupby(["id", "my interval"]).agg("count")["dummy"]
Or you can use both:
df["something that I need"] = df["my interval"] & (df["id"] == "b")
df.groupby(["something that I need"]).agg("count")["dummy"]
you might also want to apply integer division to generate time intervals:
df = pd.DataFrame({"id":["a","b","a","b","c","c"], "timestamp":[1,2,13,14,25,26], "sales": [0,4,2,3,6,7]})
epoch = 10
df["my interval"] = epoch* (df["timestamp"] // epoch)
df.groupby(["my interval"]).agg(sum)["sales"]
EDIT:
your example:
import pandas as pd
A = "A"
B = "B"
df = pd.DataFrame({"id":[A,A,A,B,B], "timestamp":[5,7,3,9,5]})
df["dummy"] = 1
Solution:
grouper = (df["id"] == A) & (4 < df["timestamp"] ) & ( df["timestamp"] < 11)
df.groupby( grouper ).agg(sum)["dummy"]
or better:
df[grouper]["dummy"].sum()

Categories