Combine Consecutive Rows for given index values in Pandas DataFrame - python

I was extracting tables from a PDF with tabula-py. But in a table where some rows were more than one line, but in tabula-py, a single-table row is converted as multiple rows in DataFrame. I'm giving a sample here.
Serial No. Name Type Total
0 1 Easter Multiple 19
1 2 Costeri Roundabout 16
2 3 Zhiop Tee 16
3 4 Nesss Cross 10
4 5 Uoar Lhahara Tee 10
5 6 Trino Nishra (KX) Tee 9
6 7 Old-FX Box Cross 8
7 8 Gardeners Roundabout 8
8 9 Max Detter Roundabout 7
9 NaN Others (Asynco, NaN NaN
10 10 D+ E, Cross 7
11 NaN etc) NaN NaN
If you look at the sample you will see that rows in 9, 10, and 11 indices are actually a single row. There was multiple line in the table (in pdf). This table has more than 100 rows and at least 12 places those issues have occurred. Some places it is 2 consecutive rows and in some places it is 3 consecutive rows. How can we merge those rows with index values?

You can try:
df['Serial No.'] = df['Serial No.'].bfill().ffill()
df['Total'] = df['Total'].astype(str).replace('nan', np.nan)
df_out = df.groupby('Serial No.', as_index=False).agg(lambda x: ''.join(x.dropna()))
df_out['Total'] = df_out['Total'].replace('', np.nan, regex=True).astype(float)
Result:
print(df_out)
Serial No. Name Type Total
0 1.0 Easter Multiple 19.0
1 2.0 Costeri Roundabout 16.0
2 3.0 Zhiop Tee 16.0
3 4.0 Nesss Cross 10.0
4 5.0 Uoar Lhahara Tee 10.0
5 6.0 Trino Nishra(KX) Tee 9.0
6 7.0 Old-FX Box Cross 8.0
7 8.0 Gardeners Roundabout 8.0
8 9.0 Max Detter Roundabout 7.0
9 10.0 Others (Asynco,D+ E,etc) Cross 7.0

Related

Python Pandas - How to create a dataframe from a sequence

I'm trying to create a dataframe populated by repeating rows based on an existing steady sequence.
For example, if I had a sequence increasing in 3s from 6 to 18, the sequence could be generated using np.arange(6, 18, 3) to give array([ 6, 9, 12, 15]).
How would I go about generating a dataframe in this way?
How could I get the below if I wanted 6 repeated rows?
0 1 2 3
0 6.0 9.0 12.0 15.0
1 6.0 9.0 12.0 15.0
2 6.0 9.0 12.0 15.0
3 6.0 9.0 12.0 15.0
4 6.0 9.0 12.0 15.0
5 6.0 9.0 12.0 15.0
6 6.0 9.0 12.0 15.0
The reason for creating this matrix is that I then wish to add a pd.sequence row-wise to this matrix
pd.DataFrame([np.arange(6, 18, 3)]*7)
alternately,
pd.DataFrame(np.repeat([np.arange(6, 18, 3)],7, axis=0))
0 1 2 3
0 6 9 12 15
1 6 9 12 15
2 6 9 12 15
3 6 9 12 15
4 6 9 12 15
5 6 9 12 15
6 6 9 12 15
Here is a solution using NumPy broadcasting which avoids Python loops, lists, and excessive memory allocation (as done by np.repeat):
pd.DataFrame(np.broadcast_to(np.arange(6, 18, 3), (6, 4)))
To understand why this is more efficient than other solutions, refer to the np.broadcast_to() docs: https://numpy.org/doc/stable/reference/generated/numpy.broadcast_to.html
more than one element of a broadcasted array may refer to a single memory location.
This means that no matter how many rows you create before passing to Pandas, you're only really allocating a single row, then a 2D array which refers to the data of that row multiple times.
If you assign the above to df, you can say df.values.base is a single row--this is the only storage required no matter how many rows appear in the DataFrame.

How to use pandas resample using 'day of year' data (Python)

My pandas array looks like this...
DOY Value
0 5 5118
1 10 5098
2 15 5153
I've been trying to resample my data and fill in the gaps using pandas resample function. My worry is that since I'm trying to resample without using direct datetime values, I won't be able to resample my data.
My attempt to solve this was using the following line of code but got an error saying I was using Range Index. Perhaps I need to use Period Index somehow, but I'm not sure how to go about it.
inter.resample('1D').mean().interpolate()
Here's my intended result
DOY Value
0 5 5118
1 6 5114
2 7 5110
3 8 5106
4 9 5102
5 10 5098
: : :
10 15 5153
Convert to_datetime, perform the resample and then drop the unwanted column:
df["date"] = pd.to_datetime(df["DOY"].astype(str),format="%j")
output = df.resample("D", on="date").last().drop("date", axis=1).interpolate().reset_index(drop=True)
>>> output
DOY Value
0 5.0 5118.0
1 6.0 5114.0
2 7.0 5110.0
3 8.0 5106.0
4 9.0 5102.0
5 10.0 5098.0
6 11.0 5109.0
7 12.0 5120.0
8 13.0 5131.0
9 14.0 5142.0
10 15.0 5153.0
pd.DataFrame.interpolate works on the index. So let's start with setting an appropriate index and then a new one over which we will interpolate.
d0 = df.set_index('DOY')
idx = pd.RangeIndex(d0.index.min(), d0.index.max()+1, name='DOY')
d0.reindex(idx).interpolate().reset_index()
DOY Value
0 5 5118.0
1 6 5114.0
2 7 5110.0
3 8 5106.0
4 9 5102.0
5 10 5098.0
6 11 5109.0
7 12 5120.0
8 13 5131.0
9 14 5142.0
10 15 5153.0

Creating a dataframe from monthly values which dont start on january

So, i have some data in list form, such as:
Q=[2,3,4,5,6,7,8,9,10,11,12] #values
M=[11,0,1,2,3,4,5,6,7,8,9] #months
Y=[2010,2011,2011,2011,2011,2011,2011,2011,2011,2011,2011] #years
And i want to get a dataframe, with one row per year, and one column per month, adding the data of Q on the positions given by M and Y.
so far i have tried a couple of things, my current code is as follows:
def save_data(data_list,year_info,month_info):
#how many datapoints
n_data=len(data_list)
#how many years
y0=year_info[0]
yf=year_info[n_data-1]
n_years=yf-y0+1
#creating the list i want to fill out
df_list=[[math.nan]*12]*n_years
ind=0
for y in range(n_years):
for m in range(12):
if ind<len(data_list):
if year_info[ind]-y0==y and month_info[ind]==m:
df_list[y][m]=data_list[ind]
ind+=1
df=pd.DataFrame(df_list)
return df
I get this output:
0
1
2
3
4
5
6
7
8
9
10
11
0
3
4
5
6
7
8
9
10
11
12
nan
2
1
3
4
5
6
7
8
9
10
11
12
nan
2
And i want to get:
0
1
2
3
4
5
6
7
8
9
10
11
0
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
2
1
3
4
5
6
7
8
9
10
11
12
nan
nan
I have tried doing a bunch of diferent things, but so far nothing has worked, I'm wondering if there's a more straightforward way of doing this, my code seems to be overwriting in a weird way, i do not know for instance why is there a 2 on the last value of second row, since that's the first value of my list.
Thanks in advance!
Try pivot:
(pd.DataFrame({'Y':Y,'M':M,'Q':Q})
.pivot(index='Y', columns='M', values='Q')
)
Output:
M 0 1 2 3 4 5 6 7 8 9 11
Y
2010 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2.0
2011 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 12.0 NaN

How fill unstinting numeric values in df column

so I am trying to add rows to data frame that should follow a numeric order 1 to 52
but my data is missing numbers, so I need to add these rows and fill these spots with NaN values or null.
df = pd.DataFrame("Weeks": [1,2,3,15,16,20,21,52],
"Values": [10,10,10,10,50,60,70,40])
Desired output:
Weeks Values
1 10
2 10
3 10
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
...
52 40
and so on until it reach Weeks = 52
My solution:
new_df = pd.DataFrame("Weeks": "" , "Values":"")
for x in range(1,53):
for i in df.Weeks:
if x == i:
new_df["Weeks"] = x
new_df["Values"] = df.Values[i]
The problem it is super inefficient, anyone know a way to do it in much efficient way?
You could use set_index to set the Weeks as index an reindex with a range up to the maximum week:
df.set_index('Weeks').reindex(range(1,df.Weeks.max()))
Or accounting for the minimum week too:
df.set_index('Weeks').reindex(range(*df.Weeks.agg(('min', 'max'))))
Values
Weeks
1 10.0
2 10.0
3 10.0
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
10 NaN
11 NaN
12 NaN
13 NaN
14 NaN
15 10.0
16 50.0
17 NaN
...

opposite of df.diff() in pandas

I have searched the forums in search of a cleaner way to create a new column in a dataframe that is the sum of the row with the previous row- the opposite of the .diff() function which takes the difference.
this is how I'm currently solving the problem
df = pd.DataFrame ({'c':['dd','ee','ff', 'gg', 'hh'], 'd':[1,2,3,4,5]}
df['e']= df['d'].shift(-1)
df['f'] = df['d'] + df['e']
Your ideas are appreciated.
You can use rolling with a window size of 2 and sum:
df['f'] = df['d'].rolling(2).sum().shift(-1)
c d f
0 dd 1 3.0
1 ee 2 5.0
2 ff 3 7.0
3 gg 4 9.0
4 hh 5 NaN
df.cumsum()
Example:
data = {'a':[1,6,3,9,5], 'b':[13,1,2,5,23]}
df = pd.DataFrame(data)
df =
a b
0 1 13
1 6 1
2 3 2
3 9 5
4 5 23
df.diff()
a b
0 NaN NaN
1 5.0 -12.0
2 -3.0 1.0
3 6.0 3.0
4 -4.0 18.0
df.cumsum()
a b
0 1 13
1 7 14
2 10 16
3 19 21
4 24 44
If you cannot use rolling, due to multindex or else, you can try using .cumsum(), and then .diff(-2) to sub the .cumsum() result from two positions before.
data = {'a':[1,6,3,9,5,30, 101, 8]}
df = pd.DataFrame(data)
df['opp_diff'] = df['a'].cumsum().diff(2)
a opp_diff
0 1 NaN
1 6 NaN
2 3 9.0
3 9 12.0
4 5 14.0
5 30 35.0
6 101 131.0
7 8 109.0
Generally to get an inverse of .diff(n) you should be able to do .cumsum().diff(n+1). The issue is that that you will get n+1 first results as NaNs

Categories