I have a couple of a set of data with timestamp, value and quality flag. The value and quality flag are missing for some of the timestamps, and needs to be filled with a dependence on the surrounding data. I.e.,
If the quality flags on the valid data bracketing the NaN data are different, then set the value and quality flag to the same as the bracketing row with the highest quality flag. In the example below, the first set of NaNs would be replaced with qf=3 and value=3.
If the quality flags are the same, then interpolate the value between the two valid values on either side. In the example, the second set of NaNs would be replaced by qf = 1 and v = 6 and 9.
Code:
import datetime
import pandas as pd
start = datetime.strptime("2004-01-01 00:00","%Y-%m-%d %H:%M")
end = datetime.strptime("2004-01-01 03:00","%Y-%m-%d %H:%M")
df = pd.DataFrame(\
data = {'v' : [1,2,'NaN','NaN','NaN',3,2,1,5,3,'NaN','NaN',12,43,23,12,32,12,12],\
'qf': [1,1,'NaN','NaN','NaN',3,1,5,1,1,'NaN','NaN',1,3,4,2,1,1,1]},\
index = pd.date_range(start, end,freq="10min"))
I have tried to solve this by finding the NA rows and looping through them, to fix the first criteron, then using interpolate to solve the second. However, this is really slow as I am working with a large set.
One approach would just be to do all the possible fills and then choose among them as appropriate. After doing df = df.astype(float) if necessary (your example uses the string "NaN"), something like this should work:
is_null = df.qf.isnull()
fill_down = df.ffill()
fill_up = df.bfill()
df.loc[is_null & (fill_down.qf > fill_up.qf)] = fill_down
df.loc[is_null & (fill_down.qf < fill_up.qf)] = fill_up
df = df.interpolate()
It does more work than is necessary, but it's easy to see what it's doing, and the work that it does do is vectorized and so happens pretty quickly. On a version of your dataset expanded to be ~10M rows (with the same density of nulls), it takes ~6s on my old notebook. Depending on your requirements that might suffice.
Related
I have a spreadsheet with fields containing a body of text.
I want to calculate the Gunning-Fog score on each row and have the value output to that same excel file as a new column. To do that, I first need to calculate the score for each row. The code below works if I hard key the text into the df variable. However, it does not work when I define the field in the sheet (i.e., rfds) and pass that through to my r variable. I get the following error, but two fields I am testing contain 3,896 and 4,843 words respectively.
readability.exceptions.ReadabilityException: 100 words required.
Am I missing something obvious? Disclaimer, I am very new to python and coding in general! Any help is appreciated.
from readability import Readability
import pandas as pd
df = pd.read_excel(r"C:/Users/name/edgar/test/item1a_sandbox.xls")
rfd = df["Item 1A"]
rfds = rfd.to_string() # to fix "TypeError: expected string or buffer"
r = Readability(rfds)
fog = r.gunning_fog()
print(fog.score)
TL;DR: You need to pass the cell value and are currently passing a column of cells.
This line rfd = df["Item 1A"] returns a reference to a column. rfd.to_string() then generates a string containing either length (number of rows in the column) or the column reference. This is why a TypeError was thrown - neither the length nor the reference are strings.
Rather than taking a column and going down it, approach it from the other direction. Take the rows and then pull out the column:
for index, row in df.iterrows():
print(row.iloc[2])
The [2] is the column index.
Now a cell identifier exists, this can be passed to the Readability calculator:
r = Readability(row.iloc[2])
fog = r.gunning_fog()
print(fog.score)
Note that these can be combined together into one command:
print(Readability(row.iloc[2]).gunning_fog())
This shows you how commands can be chained together - which way you find it easier is up to you. The chaining is useful when you give it to something like apply or applymap.
Putting the whole thing together (the step by step way):
from readability import Readability
import pandas as pd
df = pd.read_excel(r"C:/Users/name/edgar/test/item1a_sandbox.xls")
for index, row in df.iterrows():
r = Readability(row.iloc[2])
fog = r.gunning_fog()
print(fog.score)
Or the clever way:
from readability import Readability
import pandas as pd
df = pd.read_excel(r"C:/Users/name/edgar/test/item1a_sandbox.xls")
print(df["Item 1A"].apply(lambda x: Readability(x).gunning_fog()))
I have a data frame with 1 column.
- There are many NA values at the beginning and at the end that I would like to eliminate them completely.
- At the same time, there are some NA values in the between of 2 available values that I would like to fill them by the mean of 2 closed available values.
For illustration, I attach the image here for your imagine.
I can not think of any solution. Just wonder if anyone can please help me with that.
Thank you for your help]1
Try this,i have reproduced example by using random numbers
import pandas as pd
import numpy as np
random_index = np.random.randint(0,100,size=(5, 1))
random_range = np.arange(10,15)
df = pd.DataFrame(np.random.randint(0,100,size=(100, 1)), columns=list('A'))
df.loc[10:15,'A'] = "#N/A"
for c in random_index:
df.loc[c,"A"] = "#N/A"
// replacing start from here
df[df=="#N/A"]= np.nan
index = list(np.where(df['A'].isna()))[0]
drops = []
for i in index:
if pd.isnull(df.loc[(i-1),"A"]) is False and pd.isnull(df.loc[(i+1),"A"]) is False:
df.loc[i,"A"] = (df.loc[(i-1),"A"]+df.loc[(i+1),"A"])/2
else:
drops.append(i)
df = df.drop(df.index[drops]).reset_index(drop=True)
First, if each N/A is in string format, replace either with np.nan.The most straightforward possible way is to use isnan on the given column, then extract true indices(such as using the result on a np.arange array). From there you can either use a for to iterate indices to check if they are sequential or not, or calculate the distance between consecutive elements to find the ones not equal to 1.
I need to create a data stucture allowing indexing via a tuple of floats. Each dimension of the tuple represents one parameter. Each parameter spans a continuous range and to be able to perform my work, I binned the range to categories.
Then, I want to create a dataframe with a MultiIndex, each dimension of the index referring to a parameter with the defined categories
import pandas as pd
import numpy as np
index = pd.interval_range(start=0, end=10, periods = 5, closed='both')
index2 = pd.interval_range(start=20, end=30, periods = 3, closed='both')
index3 = pd.MultiIndex.from_product([index,index2])
dataStructure = pd.DataFrame(np.zeros((5*3,1)), index = index3)
print(Qhat)
I checked that the interval_range provides me with the necessary methods e.g.
index.get_loc(2.5)
would provide me the right answer. However I can't extend this with the dataframe nor the multiIndex
index3.get_loc((2.5,21))
does not work. Any idea ? I managed to get that working yesterday somehow therefore I am 99% convinced there is a simple way to make this work. But my jupyter notebook was in the cloud and the server crashed and notebook has been lost. I became dumber overnight apparently.
I think select by tuple is not implemented yet, possible solution is get position for each level separately with Index.get_level_values, get intersection by intersect1d and last select by iloc:
idx1 = df.index.get_level_values(0).get_loc(2.5)
idx2 = df.index.get_level_values(1).get_loc(21)
df1 = df.iloc[np.intersect1d(idx1, idx2)]
print (df1)
0
[2, 4] [20.0, 23.333333333333332] 0.0
I have a 10 GB csv file with 170,000,000 rows and 23 columns that I read in to a dataframe as follows:
import pandas as pd
d = pd.read_csv(f, dtype = {'tax_id': str})
I also have a list of strings with nearly 20,000 unique elements:
h = ['1123787', '3345634442', '2342345234', .... ]
I want to create a new column called class in the dataframe d. I want to assign d['class'] = 'A' whenever d['tax_id'] has a value that is found in the list of strings h. Otherwise, I want d['class'] = 'B'.
The following code works very quickly on a 1% sample of my dataframe d:
d['class'] = 'B'
d.loc[d['tax_num'].isin(h), 'class'] = 'A'
However, on the complete dataframe d, this code takes over 48 hours (and counting) to run on a 32 core server in batch mode. I suspect that indexing with loc is slowing down the code, but I'm not sure what it could really be.
In sum: Is there a more efficient way of creating the class column?
If your tax numbers are unique, I would recommend setting tax_num to the index and then indexing on that. As it stands, you call isin which is a linear operation. However fast your machine is, it can't do a linear search on 170 million records in a reasonable amount of time.
df.set_index('tax_num', inplace=True) # df = df.set_index('tax_num')
df['class'] = 'B'
df.loc[h, 'class'] = 'A'
If you're still suffering from performance issues, I'd recommend switching to distributed processing with dask.
"I also have a list of strings with nearly 20,000 unique elements"
Well, for starters, you should make that list a set if you are going to be using it for membership testing. list objects have linear time membership testing, set objects have very optimized constant-time performance for membership testing. That is the lowest hanging fruit here. So use
h = set(h) # convert list to set
d['class'] = 'B'
d.loc[d['tax_num'].isin(h), 'class'] = 'A'
i would like to create a pandas SparseDataFrame with the Dimonson 250.000 x 250.000. In the end my aim is to come up with a big adjacency matrix.
So far that is no problem to create that data frame:
df = SparseDataFrame(columns=arange(250000), index=arange(250000))
But when i try to update the DataFrame, i become massive memory/runtime problems:
index = 1000
col = 2000
value = 1
df.set_value(index, col, value)
I checked the source:
def set_value(self, index, col, value):
"""
Put single value at passed column and index
Parameters
----------
index : row label
col : column label
value : scalar value
Notes
-----
This method *always* returns a new object. It is currently not
particularly efficient (and potentially very expensive) but is provided
for API compatibility with DataFrame
...
The latter sentence describes the problem in this case using pandas? I really would like to keep on using pandas in this case, but its totally impossible in this case!
Does someone have an idea, how to solve this problem more efficiently?
My next idea is to work with something like nested lists/dicts or so...
thanks for your help!
Do it this way
df = pd.SparseDataFrame(columns=np.arange(250000), index=np.arange(250000))
s = df[2000].to_dense()
s[1000] = 1
df[2000] = s
In [11]: df.ix[1000,2000]
Out[11]: 1.0
So the procedure is to swap out the entire series at a time. The SDF will convert the passed in series to a SparseSeries. (you can do it yourself to see what they look like with s.to_sparse(). The SparseDataFrame is basically a dict of these SparseSeries, which themselves are immutable. Sparseness will have some changes in 0.12 to better support these types of operations (e.g. setting will work efficiently).