parsing through and editing csv with pandas - python

I'm trying to parse through all of the cells in a csv file that represent heights and round what's after the decimal to match a number in a list (to round down to the nearest inch). After a few days of banging my head against the wall, this is the coding I've been able to get working:
import math
import pandas as pd
inch = [.0, .08, .16, .25, .33, .41, .50, .58, .66, .75, .83, .91, 1]
df = pd.read_csv("sample_csv.csv")
def to_number(s):
for index, row in df.iterrows():
try:
num = float(s)
num = math.modf(num)
num = list(num)
for i,j in enumerate(inch):
if num[0] < j:
num[0] = inch[i-1]
break
elif num[0] == j:
num[0] = inch[i]
break
newnum = num[0] + num[1]
return newnum
except ValueError:
return s
df = df.apply(lambda f : to_number(f[0]), axis=1).fillna('')
with open('new.csv', 'a') as f:
df.to_csv(f, index=False)
Ideally I'd like to have it parse over an entire CSV with n headers, ignoring all strings and round the floats to match the list. Is there a simple(r) way to achieve this with Pandas? And would it be possible (or a good idea?) to have it edit the existing excel workbook instead of creating a new csv i'd have to copy/paste over?
Any help or suggestions would be greatly appreciated as I'm very new to Pandas and it's pretty god damn intimidating!

Helping would be a lot easier if you include a sample mock of the data you're trying to parse. To clarify the points you don't specify, as I understand it
By "an entire CSV with n headers, ignoring all strings and round the floats to match the list" you mean some n-column dataframe with k numeric columns each of which describe someone's height in inches.
The entries in the numeric columns are measured in units of feet.
You want to ignore the non-numeric columns and transform the data as 6.14 -> 6 feet, 1 inches (I'm implicitly assuming that by "round down" you want an integer floor; i.e. 6.14 feet is 6 feet, 0.14*12 = 1.68 inches; it's up to you whether this is floored or rounded to the nearest integer).
Now for a subset of random heights measured in feet sampled uniformly over 5.1 feet and 6.9 feet, we could do the following:
In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: df = pd.DataFrame(np.random.uniform(5.1, 6.9, size=(10,3)))
In [4]: df
Out[4]:
0 1 2
0 6.020613 6.315707 5.413499
1 5.942232 6.834540 6.761765
2 5.715405 6.162719 6.363224
3 6.416955 6.511843 5.512515
4 6.472462 5.789654 5.270047
5 6.370964 5.509568 6.113121
6 6.353790 6.466489 5.460961
7 6.526039 5.999284 6.617608
8 6.897215 6.016648 5.681619
9 6.886359 5.988068 5.575993
In [5]: np.fix(df) + np.floor(12*(df - np.fix(df)))/12
Out[5]:
0 1 2
0 6.000000 6.250000 5.333333
1 5.916667 6.833333 6.750000
2 5.666667 6.083333 6.333333
3 6.416667 6.500000 5.500000
4 6.416667 5.750000 5.250000
5 6.333333 5.500000 6.083333
6 6.333333 6.416667 5.416667
7 6.500000 5.916667 6.583333
8 6.833333 6.000000 5.666667
9 6.833333 5.916667 5.500000
We're using np.fix to extract the integral part of the height value. Likewise, df - np.fix(df) represents the fractional remainder in feet or in inches when multiplied by 12. np.floor just truncates this to the nearest inch below, and the final division by 12 returns the unit of measurement from inches to feet.
You can change np.floor to np.round to get an answer rounded to the nearest inch rather than truncated to the previous whole inch. Finally, you can specify the precision of the output to insist that the decimal portion is selected from your list.
In [6]: (np.fix(df) + np.round(12*(df - np.fix(df)))/12).round(2)
Out[6]:
0 1 2
0 6.58 5.25 6.33
1 5.17 6.42 5.67
2 6.42 5.83 6.33
3 5.92 5.67 6.33
4 6.83 5.25 6.58
5 5.83 5.50 6.92
6 6.83 6.58 6.25
7 5.83 5.33 6.50
8 5.25 6.00 6.83
9 6.42 5.33 5.08

Adding onto the other answer to address your problem with strings:
# Break the dataframe with a string
df = pd.DataFrame(np.random.uniform(5.1, 6.9, size=(10,3)))
df.ix[0,0] = 'str'
# Find out which things can be cast to numerics and put NaNs everywhere else
df_safe = df.apply(pd.to_numeric, axis=0, errors="coerce")
df_safe = (np.fix(df_safe) + np.round(12*(df_safe - np.fix(df_safe)))/12).round(2)
# Replace all the NaNs with the original data
df_safe[df_safe.isnull()] = df[df_safe.isnull()]
df_safe should be what you want. Despite the name, this isn't particularly safe and there are probably edge conditions that will be a problem.

Related

Removed unwanted characters from string using pandas [duplicate]

This question already has answers here:
Extract float/double value
(5 answers)
Closed last year.
I have the following dataframe:
df = pd.DataFrame({'A': ['2.5cm','2.5cm','2.56”','1.38”','2.2”','0.8 in','$18.00','4','2"']})
which looks like:
A
2.5cm
2.5cm
2.56”
1.38”
2.2”
0.8 in
$18.00
4
2"
I want to remove all characters except for the decimal points.
The output should be:
A
2.5
2.5
2.56
1.38
2.2
0.8
18.00
4
2
Here is what I've tried:
df['A'] = df.A.str.replace(r"[a-zA-Z]", '')
df['A'] = df.A.str.replace('\W', '')
but this is stripping out everything including the decimal point.
Any suggestions would be greatly appreciated.
Thank you in advance
You can use str.extract to extract only the floating points:
df['A'] = df['A'].astype(str).str.extract(r'(\d+.\d+|\d)').astype('float')
However, '.' here matches any character, not just the period. So it will match 18,00 instead of 18. Also it fails to extract multidigit whole numbers. Use the code below. (thanks #DYZ):
df['A'] = df['A'].astype(str).str.extract(r'(\d+\.\d+|\d+)').astype('float')
Output:
A
0 2.50
1 2.50
2 2.56
3 1.38
4 2.20
5 0.80
6 18.00
7 4.00
8 2.00
Try with str.extract
df['new'] = df.A.str.extract('(\d*\.\d+|\d+)').astype(float).iloc[:,0]
Out[31]:
0
0 2.50
1 2.50
2 2.56
3 1.38
4 2.20
5 0.80
6 18.00

Python/Pandas For Loop Time Series

I am working with panel time-series data and am struggling with creating a fast for loop, to sum up, the past 50 numbers at the current i. The data is like 600k rows, and it starts to churn around 30k. Is there a way to use pandas or Numpy to do the same at a fraction of the time?
The change column is of type float, with 4 decimals.
Index Change
0 0.0410
1 0.0000
2 0.1201
... ...
74327 0.0000
74328 0.0231
74329 0.0109
74330 0.0462
SEQ_LEN = 50
for i in range(SEQ_LEN, len(df)):
df.at[i, 'Change_Sum'] = sum(df['Change'][i-SEQ_LEN:i])
Any help would be highly appreciated! Thank you!
I tried this with 600k rows and the average time was
20.9 ms ± 1.35 ms
This will return a series with the rolling sum for the last 50 Change in the df:
df['Change'].rolling(50).sum()
you can add it to a new column like so:
df['change50'] = df['Change'].rolling(50).sum()
Disclaimer: This solution cannot compete with .rolling(). Plus, if a .groupby() case, just do a df.groupby("group")["Change"].rolling(50).sum() and then reset index. Therefore please accept the other answer.
Explicit for loop can be avoided by translating your recursive partial sum into the difference of cumulative sum (cumsum). The formula:
Sum[x-50:x] = Sum[:x] - Sum[:x-50] = Cumsum[x] - Cumsum[x-50]
Code
For showcase purpose, I have shorten len(df["Change"]) to 10 and SEQ_LEN to 5. A million records completed almost immediately in this way.
import pandas as pd
import numpy as np
# data
SEQ_LEN = 5
np.random.seed(111) # reproducibility
df = pd.DataFrame(
data={
"Change": np.random.normal(0, 1, 10) # a million rows
}
)
# step 1. Do cumsum
df["Change_Cumsum"] = df["Change"].cumsum()
# Step 2. calculate diff of cumsum: Sum[x-50:x] = Sum[:x] - Sum[:x-50]
df["Change_Sum"] = np.nan # or zero as you wish
df.loc[SEQ_LEN:, "Change_Sum"] = df["Change_Cumsum"].values[SEQ_LEN:] - df["Change_Cumsum"].values[:(-SEQ_LEN)]
# add idx=SEQ_LEN-1
df.at[SEQ_LEN-1, "Change_Sum"] = df.at[SEQ_LEN-1, "Change_Cumsum"]
Output
df
Out[30]:
Change Change_Cumsum Change_Sum
0 -1.133838 -1.133838 NaN
1 0.384319 -0.749519 NaN
2 1.496554 0.747035 NaN
3 -0.355382 0.391652 NaN
4 -0.787534 -0.395881 -0.395881
5 -0.459439 -0.855320 0.278518
6 -0.059169 -0.914489 -0.164970
7 -0.354174 -1.268662 -2.015697
8 -0.735523 -2.004185 -2.395838
9 -1.183940 -3.188125 -2.792244

Round a float to the smallest number keeping a few decimal places

I need to round a column with floats to 2 decimal places, but without rounding the data to the nearest value
My data:
df = pd.DataFrame({'numbers': [1.233,1.238,5.059,5.068, 8.556]})
df.head()
numbers
0 1.233
1 1.238
2 5.059
3 5.068
4 8.556
Expected output:
numbers
0 1.23
1 1.23
2 5.05
3 5.06
4 8.55
The problem
Everything I've tried rounds the numbers to the nearest number (0-4 is 0 and 5-9 is added 1 to the truncated decimal place)
Examples of what didn't work
df[['numbers']].round(2)
#or df['numbers'].apply(lambda x: "%.2f" % x)
#output
numbers
0 1.23
1 1.24
2 5.06
3 5.07
4 8.56
This is more like round down
df.numbers*100//1/100
Out[186]:
0 1.23
1 1.23
2 5.05
3 5.06
4 8.55
Name: numbers, dtype: float64
Try this, works well also
import pandas as pd
do = lambda x: float(str(x).split('.')[0] +'.' + str(x).split('.')[1][0:2])
df = pd.DataFrame({'numbers': list(map(do, [1.233,1.238,5.059,5.068, 8.556]))})
print(df.head())
output
numbers
0 1.23
1 1.23
2 5.05
3 5.06
4 8.55

Python - Using Pandas to eliminated curly brackets and output floats

Having this large csv data set, that essentially has x and y values in each column.
"{733.15, 179.5}",
"{565.5, 642.5}",
"{172.5, 375.5}",
"{223.5, 554.5}",....
....,
"{213.5, 666.5}",
"{851.5, 323.5}",
"{498.5, 638.5}",
"{763.5, 102.5}"
or by a table,
A column is essentially this set and I can call each pair by indexing.
import numpy as np
import pandas as pd
import csv
brown = pd.read_csv('BrownM.csv',delimiter=',', header=None)
print brown[0]
this essentially calls the row above
print brown[0][0]
returns {733.15, 179.5}
but when wanting to select a value in this set,
print brown[0][0][1]
returns 7
It's treating this data set as a string, when I want it to return floats when called upon.
Also, is their a way to pass the file to where the curly brackets are eliminated?
Or you can extract then split.
df.col1.str.extract(r'{(.*)}', expand=False).str.split(', ', expand=True)
Timing
MaxU's solution is quicker as it does more in one step as opposed to mine that takes two steps.
UPDATE:
def str2coords(df, col, new_cols):
df[new_cols] = df[col].str.extract(r'\{([\d\.]+),\s*([\d\.]+)\}', expand=True).astype(np.float64)
return df.drop(col, axis=1)
In [204]: df
Out[204]:
coord1 coord2
0 {733.15, 179.5} {33.15, 79.5}
1 {565.5, 642.5} {65.5, 42.5}
2 {172.5, 375.5} {72.5, 75.5}
3 {223.5, 554.5} {23.5, 54.5}
4 {213.5, 666.5} {13.5, 66.5}
5 {851.5, 323.5} {51.5, 23.5}
6 {498.5, 638.5} {98.5, 38.5}
7 {763.5, 102.5} {63.5, 02.5}
In [205]: df = str2coords(df, 'coord1', ['x1','y1'])
In [206]: df = str2coords(df, 'coord2', ['x2','y2'])
In [207]: df
Out[207]:
x1 y1 x2 y2
0 733.15 179.5 33.15 79.5
1 565.50 642.5 65.50 42.5
2 172.50 375.5 72.50 75.5
3 223.50 554.5 23.50 54.5
4 213.50 666.5 13.50 66.5
5 851.50 323.5 51.50 23.5
6 498.50 638.5 98.50 38.5
7 763.50 102.5 63.50 2.5
In [208]: df.dtypes
Out[208]:
x1 float64
y1 float64
x2 float64
y2 float64
dtype: object
you can parse your coordinates into separate columns, using .str.extract() function:
In [155]: df[['x','y']] = df.coord.str.extract(r'\{([\d\.]+),\s*([\d\.]+)\}', expand=True)
In [156]: df
Out[156]:
coord x y
0 {733.15, 179.5} 733.15 179.5
1 {565.5, 642.5} 565.5 642.5
2 {172.5, 375.5} 172.5 375.5
3 {223.5, 554.5} 223.5 554.5
4 {213.5, 666.5} 213.5 666.5
5 {851.5, 323.5} 851.5 323.5
6 {498.5, 638.5} 498.5 638.5
7 {763.5, 102.5} 763.5 102.5
You can use a regex against the string, then parse it to a float.
import re
# Returns 733.15
float(re.match(r'\{(.*),\s*(.*)\}', '{733.15, 179.5}').group(1))
# Returns 179.5
float(re.match(r'\{(.*),\s*(.*)\}', '{733.15, 179.5}').group(2))

Efficient way to set pandas DataFrame values quickly in a for loop? Possible to vectorize?

A quick rundown of my goal:
I am trying to make a DataFrame that contains arrays of cashflow payments. Rows are m number of loans and columns are n number of dates, and values are the associated payments on those dates, if any. My current approach is to generate the m x n DataFrame, then find each cashflow on each date of each loan and set the corresponding section of the DataFrame to that value.
cashflow_frame = pd.DataFrame(columns = all_dates, index = all_ids)
I currently have a for loop that does what I want, but takes much too long to execute. I've line profiled the code:
Timer unit: 1e-06 s
Total time: 38.6231 s
Line # Hits Time Per Hit % Time Line Contents
==============================================================
2 def frameMaker():
3 3642 1208212 331.7 3.1 for ids, slices in id_grouped:
4 3641 3542 1.0 0.0 data_slice = slices
5 3641 17040 4.7 0.0 original_index = data_slice.index.values
6 3641 583252 160.2 1.5 funded_amt = -data_slice.ix[original_index[0],'outstanding_princp_beg']
7 3641 2091958 574.6 5.4 issue_d = data_slice.ix[original_index[0], 'issue_d']
8 3641 346722 95.2 0.9 pmt_date_ranges = data_slice['date'].values
9 3641 101051 27.8 0.3 date_ranges = np.append(issue_d, pmt_date_ranges)
10 3641 310452 85.3 0.8 rest_cfs = data_slice['pmt_amt_received'].values
11 3641 50856 14.0 0.1 cfs = np.append(funded_amt, rest_cfs)
12
13 3641 321601 88.3 0.8 if data_slice.ix[original_index[-1], 'charged_off_recovs'] > 0:
14 412 6094 14.8 0.0 cfs[-1] = (data_slice.ix[original_index[-1], 'charged_off_recovs'] -
15 412 35943 87.2 0.1 data_slice.ix[original_index[-1], 'charged_off_fees'])
16
17 3641 33546392 9213.5 86.9 cashflow_frame.ix[ids,date_ranges] = cfs
so I can see that setting the arrays in the DataFrame is the slowest. I've also noticed that it gets exponentially slower/takes a larger % Time as I have more loans. Why does it get slower and slower? What is a faster way to set values in a DataFrame? Is there a way to vectorize the operations?
I'm considering generating arrays of equal length for each loan and then creating the cashflow_frame from a (dict[loan ids] = [cashflows]), but would like to use my original code if there is a way to speed it up significantly.
More details:
http://imgur.com/a/ptgbZ
id_grouped is the top DataFrame grouped by 'id'. Data is read in from csv.
My code makes the lower DataFrame, which is exactly how I want it, but it takes much too long.
The first DataFrame is 8.5m rows with about 640,000 loan ids.

Categories