Python pandas constructing dataframe by looping over columns - python

I am trying to develop a new panda dataframe based on data I got from an existing dataframe and then taking into account the previously calculated value in the new dataframe.
As an example, here are two dataframes with the same size.
df1 = pd.DataFrame(np.random.randint(0,10, size = (5, 4)), columns=['1', '2', '3', '4'])
df2 = pd.DataFrame(np.zeros(df1.shape), index=df1.index, columns=df1.columns)
Then I created a list which starts as a starting basis for my second dataframe df2
L = [2,5,6,7]
df2.loc[0] = L
Then for the remaining rows of df2 I want to take the value from the previous time step (df2) and add the value of df1.
for i in df2.loc[1:]:
df2.ix[i] = df2.ix[i-1] + df1
As an example my dataframes should look like this:
>>> df1
1 2 3 4
0 4 6 0 6
1 7 0 7 9
2 9 1 9 9
3 5 2 3 6
4 0 3 2 9
>>> df2
1 2 3 4
0 2 5 6 7
1 9 5 13 16
2 18 6 22 25
3 23 8 25 31
4 23 11 27 40
I know there is something wrong with the indication of indexes in the for loop but I cannot figure out how the argument must be formulated. I would be very thankful for any help on this.

this is a simple cumsum.
df2 = df1.copy()
df2.loc[0] = [2,5,6,7]
desired_df = df2.cumsum()

Related

How to stack two columns of a pandas dataframe in python

I want to stack two columns on top of each other
So I have Left and Right values in one column each, and want to combine them into a single one. How do I do this in Python?
I'm working with Pandas Dataframes.
Basically from this
Left Right
0 20 25
1 15 18
2 10 35
3 0 5
To this:
New Name
0 20
1 15
2 10
3 0
4 25
5 18
6 35
7 5
It doesn't matter how they are combined as I will plot it anyway, and the new column name also doesn't matter because I can rename it.
You can create a list of the cols, and call squeeze to anonymise the data so it doesn't try to align on columns, and then call concat on this list, passing ignore_index=True creates a new index, otherwise you'll get the names as index values repeated:
cols = [df[col].squeeze() for col in df]
pd.concat(cols, ignore_index=True)
Many options, stack, melt, concat, ...
Here's one:
>>> df.melt(value_name='New Name').drop('variable', 1)
New Name
0 20
1 15
2 10
3 0
4 25
5 18
6 35
7 5
You can also use np.ravel:
import numpy as np
out = pd.DataFrame(np.ravel(df.values.T), columns=['New name'])
print(out)
# Output
New name
0 20
1 15
2 10
3 0
4 25
5 18
6 35
7 5
Update
If you have only 2 cols:
out = pd.concat([df['Left'], df['Right']], ignore_index=True).to_frame('New name')
print(out)
# Output
New name
0 20
1 15
2 10
3 0
4 25
5 18
6 35
7 5
Solution with unstack
df2 = df.unstack()
# recreate index
df2.index = np.arange(len(df2))
A solution with masking.
# Your data
import numpy as np
import pandas as pd
df = pd.DataFrame({"Left":[20,15,10,0], "Right":[25,18,35,5]})
# Masking columns to ravel
df2 = pd.DataFrame({"New Name":np.ravel(df[["Left","Right"]])})
df2
New Name
0 20
1 25
2 15
3 18
4 10
5 35
6 0
7 5
I ended up using this solution, seems to work fine
df1 = dfTest[['Left']].copy()
df2 = dfTest[['Right']].copy()
df2.columns=['Left']
df3 = pd.concat([df1, df2],ignore_index=True)

Splitting the total time (in seconds) and fill the rows of a column value in 1 second frame

I have an dataframe look like (start_time and stop_time are in seconds followed by milliseconds)
And my Expected output to be like.,
I dont know how to approach this. forward filling may fill NaN values. But I need the total time seconds to be divided and saved as 1 second frame in accordance with respective labels. I dont have any code snippet to go forward. All i did is saving it in a dataframe as.,
df = pd.DataFrame(data, columns=['Labels', 'start_time', 'stop_time'])
Thank you and I really appreciate the help.
>>> df2 = pd.DataFrame({
>>> "Labels" : df.apply(lambda x:[x.Labels]*(round(x.stop_time)-round(x.start_time)), axis=1).explode(),
... "start_time" : df.apply(lambda x:range(round(x.start_time), round(x.stop_time)), axis=1).explode()
... })
>>> df2['stop_time'] = df2.start_time + 1
>>> df2
Labels start_time stop_time
0 A 0 1
0 A 1 2
0 A 2 3
0 A 3 4
0 A 4 5
0 A 5 6
0 A 6 7
0 A 7 8
0 A 8 9
1 B 9 10
1 B 10 11
1 B 11 12
1 B 12 13
2 C 13 14
2 C 14 15

Pandas Transform

let's say I have the following dataframe:
metric value last_1_day last_7_day
points 10 3 9
assists 15 2 12
rebounds 12 1 5
I want to transpose this and concatenate thre values within metric with the other columns. So the result would have 1 row and 9 columns. For example:
points_value points_last_1_day points_last_7_day assists_value assists_last_1_day assists_last_7_day rebounds_value rebounds_last_1_day rebounds_last_7_day
10 3 9 15 2 12 12 1 5
What is the best way to do this in pandas?
Try with:
df = df.set_index('metric').stack().to_frame().T
df.columns = ['_'.join(a) for a in df.columns]
Output:
points_value points_last_1_day points_last_7_day assists_value assists_last_1_day assists_last_7_day rebounds_value rebounds_last_1_day rebounds_last_7_day
0 10 3 9 15 2 12 12 1 5

Retrieving Unknown Column Names from DataFrame.apply

How I can retrieve column names from a call to DataFrame apply without knowing them in advance?
What I'm trying to do is apply a mapping from column names to functions to arbitrary DataFrames. Those functions might return multiple columns. I would like to end up with a DataFrame that contains the original columns as well as the new ones, the amount and names of which I don't know at build-time.
Other solutions here are Series-based. I'd like to do the whole frame at once, if possible.
What am I missing here? Are the columns coming back from apply lost in destructuring unless I know their names? It looks like assign might be useful, but will likely require a lot of boilerplate.
import pandas as pd
def fxn(col):
return pd.Series(col * 2, name=col.name+'2')
df = pd.DataFrame({'A': range(0, 10), 'B': range(10, 0, -1)})
print(df)
# [Edit:]
# A B
# 0 0 10
# 1 1 9
# 2 2 8
# 3 3 7
# 4 4 6
# 5 5 5
# 6 6 4
# 7 7 3
# 8 8 2
# 9 9 1
df = df.apply(fxn)
print(df)
# [Edit:]
# Observed: columns changed in-place.
# A B
# 0 0 20
# 1 2 18
# 2 4 16
# 3 6 14
# 4 8 12
# 5 10 10
# 6 12 8
# 7 14 6
# 8 16 4
# 9 18 2
df[['A2', 'B2']] = df.apply(fxn)
print(df)
# [Edit: I am doubling column values, so missing something, but the question about the column counts stands.]
# Expected: new columns added. How can I do this at runtime without knowing column names?
# A B A2 B2
# 0 0 40 0 80
# 1 4 36 8 72
# 2 8 32 16 64
# 3 12 28 24 56
# 4 16 24 32 48
# 5 20 20 40 40
# 6 24 16 48 32
# 7 28 12 56 24
# 8 32 8 64 16
# 9 36 4 72 8
You need to concat the result of your function with the original df.
Use pd.concat:
In [8]: x = df.apply(fxn) # Apply function on df and store result separately
In [10]: df = pd.concat([df, x], axis=1) # Concat with original df to get all columns
Rename duplicate column names by adding suffixes:
In [82]: from collections import Counter
In [38]: mylist = df.columns.tolist()
In [41]: d = {a:list(range(1, b+1)) if b>1 else '' for a,b in Counter(mylist).items()}
In [62]: df.columns = [i+str(d[i].pop(0)) if len(d[i]) else i for i in mylist]
In [63]: df
Out[63]:
A1 B1 A2 B2
0 0 10 0 20
1 1 9 2 18
2 2 8 4 16
3 3 7 6 14
4 4 6 8 12
5 5 5 10 10
6 6 4 12 8
7 7 3 14 6
8 8 2 16 4
9 9 1 18 2
You can assign directly with:
df[df.columns + '2'] = df.apply(fxn)
Outut:
A B A2 B2
0 0 10 0 20
1 1 9 2 18
2 2 8 4 16
3 3 7 6 14
4 4 6 8 12
5 5 5 10 10
6 6 4 12 8
7 7 3 14 6
8 8 2 16 4
9 9 1 18 2
Alternatively, you can leverage the #MayankPorwal answer by using .add_suffix('2') to the output from your apply function:
pd.concat([df, df.apply(fxn).add_suffix('2')], axis=1)
which will return the same output.
In your function, name=col.name+'2' is doing nothing (it's basically returning just col * 2). That's because apply returns the values back to the original column.
Anyways, it's possible to take the MayankPorwal approach: pd.concat + managing duplicated columns (make them unique). Another possible way to do that:
# Use pd.concat as mentioned in the first answer from Mayank Porwal
df = pd.concat([df, df.apply(fxn)], axis=1)
# Rename duplicated columns
suffix = (pd.Series(df.columns).groupby(df.columns).cumcount()+1).astype(str)
df.columns = df.columns + suffix.rename('1', '')
which returns the same output, and additionally manage further duplicated columns.
Answer on the behalf of OP:
This code does what I wanted:
import pandas as pd
# Simulated business logic: for an input row, return a number of columns
# related to the input, and generate names for them, such that we don't
# know the shape of the output or the names of its columns before the call.
def fxn(row):
length = row[0]
indicies = [row.index[0] + str(i) for i in range(0, length)]
series = pd.Series([i for i in range(0, length)], index=indicies)
return series
# Sample data: 0 to 18, inclusive, counting by 2.
df1 = pd.DataFrame(list(range(0, 20, 2)), columns=['A'])
# Randomize the rows to simulate different input shapes.
df1 = df1.sample(frac=1)
# Apply fxn to rows to get new columns (with expand). Concat to keep inputs.
df1 = pd.concat([df1, df1.apply(fxn, axis=1, result_type='expand')], axis=1)
print(df1)

Splitting a dataframe with python

What I want to do is pretty simple, in other languages. I want to split a table, using a "for" loop to split a data frame every fifth row.
The idea is that I have dataframe that adds a new row, every so often, like answering a form with different questions and every answer is added to a specific column, like Google Forms with SpreadSheet.
What I have tried is the following:
import pandas as pd
dp=[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]
df1=pd.DataFrame(data=dp)
for i in range(0, len(dp)):
if i%5==0:
df = df1.iloc[i,:]
print(df)
print(df)
Which I know isn't much but nevertheless it is a try. Now, what I can't do is create a new variable with the new dataframe every time the loop reaches the i mod 5 == 0 row.
I think you're trying to convert a flat list into rows and columns using a known number of fields.
I'd do something like this:
import numpy as np
import pandas as pd
numFields = 3 # this is five in your case
fieldNames = ['color', 'animal', 'amphibian'] # totally optional
# this is your 'dp'
inputData = ['brown', 'dog','false','green', 'toad','true']
flatDataArray = np.asarray(inputData)
reshapedData = flatDataArray.reshape(-1, numFields)
df = pd.DataFrame(reshapedData, columns=fieldNames) # you only need 'columns' if you want to name fields
print(df)
which gives:
color animal amphibian
0 brown dog false
1 green toad true
--UPDATE--
From your comment above, I see that you'd like an arbitrary number of dataframes- one for each five-row group. Why not create a list of dataframes (i.e. so you have dfs[0], dfs[1])?
# continuing with from where the previous code left off...
dfs = []
for group in reshapedData:
dfs.append(pd.DataFrame(group))
for df in dfs:
print(df)
which prints:
0
0 brown
1 dog
2 false
0
0 green
1 toad
2 true
numpy.split
lod = np.split(df1, np.arange(1, 16, 5))
print(*lod, sep='\n\n')
0
0 0
0
1 1
2 2
3 3
4 4
5 5
0
6 6
7 7
8 8
9 9
10 10
0
11 11
12 12
13 13
14 14
15 15
lod = np.split(df1, np.arange(0, 16, 5)[1:])
print(*lod, sep='\n\n')
0
0 0
1 1
2 2
3 3
4 4
0
5 5
6 6
7 7
8 8
9 9
0
10 10
11 11
12 12
13 13
14 14
0
15 15

Categories