Splitting a dataframe with python - python

What I want to do is pretty simple, in other languages. I want to split a table, using a "for" loop to split a data frame every fifth row.
The idea is that I have dataframe that adds a new row, every so often, like answering a form with different questions and every answer is added to a specific column, like Google Forms with SpreadSheet.
What I have tried is the following:
import pandas as pd
dp=[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]
df1=pd.DataFrame(data=dp)
for i in range(0, len(dp)):
if i%5==0:
df = df1.iloc[i,:]
print(df)
print(df)
Which I know isn't much but nevertheless it is a try. Now, what I can't do is create a new variable with the new dataframe every time the loop reaches the i mod 5 == 0 row.

I think you're trying to convert a flat list into rows and columns using a known number of fields.
I'd do something like this:
import numpy as np
import pandas as pd
numFields = 3 # this is five in your case
fieldNames = ['color', 'animal', 'amphibian'] # totally optional
# this is your 'dp'
inputData = ['brown', 'dog','false','green', 'toad','true']
flatDataArray = np.asarray(inputData)
reshapedData = flatDataArray.reshape(-1, numFields)
df = pd.DataFrame(reshapedData, columns=fieldNames) # you only need 'columns' if you want to name fields
print(df)
which gives:
color animal amphibian
0 brown dog false
1 green toad true
--UPDATE--
From your comment above, I see that you'd like an arbitrary number of dataframes- one for each five-row group. Why not create a list of dataframes (i.e. so you have dfs[0], dfs[1])?
# continuing with from where the previous code left off...
dfs = []
for group in reshapedData:
dfs.append(pd.DataFrame(group))
for df in dfs:
print(df)
which prints:
0
0 brown
1 dog
2 false
0
0 green
1 toad
2 true

numpy.split
lod = np.split(df1, np.arange(1, 16, 5))
print(*lod, sep='\n\n')
0
0 0
0
1 1
2 2
3 3
4 4
5 5
0
6 6
7 7
8 8
9 9
10 10
0
11 11
12 12
13 13
14 14
15 15
lod = np.split(df1, np.arange(0, 16, 5)[1:])
print(*lod, sep='\n\n')
0
0 0
1 1
2 2
3 3
4 4
0
5 5
6 6
7 7
8 8
9 9
0
10 10
11 11
12 12
13 13
14 14
0
15 15

Related

How to stack two columns of a pandas dataframe in python

I want to stack two columns on top of each other
So I have Left and Right values in one column each, and want to combine them into a single one. How do I do this in Python?
I'm working with Pandas Dataframes.
Basically from this
Left Right
0 20 25
1 15 18
2 10 35
3 0 5
To this:
New Name
0 20
1 15
2 10
3 0
4 25
5 18
6 35
7 5
It doesn't matter how they are combined as I will plot it anyway, and the new column name also doesn't matter because I can rename it.
You can create a list of the cols, and call squeeze to anonymise the data so it doesn't try to align on columns, and then call concat on this list, passing ignore_index=True creates a new index, otherwise you'll get the names as index values repeated:
cols = [df[col].squeeze() for col in df]
pd.concat(cols, ignore_index=True)
Many options, stack, melt, concat, ...
Here's one:
>>> df.melt(value_name='New Name').drop('variable', 1)
New Name
0 20
1 15
2 10
3 0
4 25
5 18
6 35
7 5
You can also use np.ravel:
import numpy as np
out = pd.DataFrame(np.ravel(df.values.T), columns=['New name'])
print(out)
# Output
New name
0 20
1 15
2 10
3 0
4 25
5 18
6 35
7 5
Update
If you have only 2 cols:
out = pd.concat([df['Left'], df['Right']], ignore_index=True).to_frame('New name')
print(out)
# Output
New name
0 20
1 15
2 10
3 0
4 25
5 18
6 35
7 5
Solution with unstack
df2 = df.unstack()
# recreate index
df2.index = np.arange(len(df2))
A solution with masking.
# Your data
import numpy as np
import pandas as pd
df = pd.DataFrame({"Left":[20,15,10,0], "Right":[25,18,35,5]})
# Masking columns to ravel
df2 = pd.DataFrame({"New Name":np.ravel(df[["Left","Right"]])})
df2
New Name
0 20
1 25
2 15
3 18
4 10
5 35
6 0
7 5
I ended up using this solution, seems to work fine
df1 = dfTest[['Left']].copy()
df2 = dfTest[['Right']].copy()
df2.columns=['Left']
df3 = pd.concat([df1, df2],ignore_index=True)

Splitting the total time (in seconds) and fill the rows of a column value in 1 second frame

I have an dataframe look like (start_time and stop_time are in seconds followed by milliseconds)
And my Expected output to be like.,
I dont know how to approach this. forward filling may fill NaN values. But I need the total time seconds to be divided and saved as 1 second frame in accordance with respective labels. I dont have any code snippet to go forward. All i did is saving it in a dataframe as.,
df = pd.DataFrame(data, columns=['Labels', 'start_time', 'stop_time'])
Thank you and I really appreciate the help.
>>> df2 = pd.DataFrame({
>>> "Labels" : df.apply(lambda x:[x.Labels]*(round(x.stop_time)-round(x.start_time)), axis=1).explode(),
... "start_time" : df.apply(lambda x:range(round(x.start_time), round(x.stop_time)), axis=1).explode()
... })
>>> df2['stop_time'] = df2.start_time + 1
>>> df2
Labels start_time stop_time
0 A 0 1
0 A 1 2
0 A 2 3
0 A 3 4
0 A 4 5
0 A 5 6
0 A 6 7
0 A 7 8
0 A 8 9
1 B 9 10
1 B 10 11
1 B 11 12
1 B 12 13
2 C 13 14
2 C 14 15

Dataframe.loc does not work on multiple index , very simple

I have a pandas data frame that looks like this:
1: As you can see, I have the index "State" and "City"
And I want to filter by state using loc, for example using:
nuevo4.loc["Bulgaria"]
(The name of the Dataframe is "nuevo4"), but instead of getting the results, I want I get the error:
KeyError: 'Bulgaria'
I read the loc documentation online and I cannot see the fail here, I'm sorry if this is too obvious, the names are well spelled, and that...
You command should work see the below example. You might have some whitespace issues with your data:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.arange(25).reshape(-1,5), index = pd.MultiIndex.from_tuples([('A',1), ('A',2),('B', 0), ('C', 1), ('C', 2)]))
print(df)
print(df.loc['A'])
Output:
0 1 2 3 4
A 1 0 1 2 3 4
2 5 6 7 8 9
B 0 10 11 12 13 14
C 1 15 16 17 18 19
2 20 21 22 23 24
Using loc:
0 1 2 3 4
1 0 1 2 3 4
2 5 6 7 8 9

Retrieving Unknown Column Names from DataFrame.apply

How I can retrieve column names from a call to DataFrame apply without knowing them in advance?
What I'm trying to do is apply a mapping from column names to functions to arbitrary DataFrames. Those functions might return multiple columns. I would like to end up with a DataFrame that contains the original columns as well as the new ones, the amount and names of which I don't know at build-time.
Other solutions here are Series-based. I'd like to do the whole frame at once, if possible.
What am I missing here? Are the columns coming back from apply lost in destructuring unless I know their names? It looks like assign might be useful, but will likely require a lot of boilerplate.
import pandas as pd
def fxn(col):
return pd.Series(col * 2, name=col.name+'2')
df = pd.DataFrame({'A': range(0, 10), 'B': range(10, 0, -1)})
print(df)
# [Edit:]
# A B
# 0 0 10
# 1 1 9
# 2 2 8
# 3 3 7
# 4 4 6
# 5 5 5
# 6 6 4
# 7 7 3
# 8 8 2
# 9 9 1
df = df.apply(fxn)
print(df)
# [Edit:]
# Observed: columns changed in-place.
# A B
# 0 0 20
# 1 2 18
# 2 4 16
# 3 6 14
# 4 8 12
# 5 10 10
# 6 12 8
# 7 14 6
# 8 16 4
# 9 18 2
df[['A2', 'B2']] = df.apply(fxn)
print(df)
# [Edit: I am doubling column values, so missing something, but the question about the column counts stands.]
# Expected: new columns added. How can I do this at runtime without knowing column names?
# A B A2 B2
# 0 0 40 0 80
# 1 4 36 8 72
# 2 8 32 16 64
# 3 12 28 24 56
# 4 16 24 32 48
# 5 20 20 40 40
# 6 24 16 48 32
# 7 28 12 56 24
# 8 32 8 64 16
# 9 36 4 72 8
You need to concat the result of your function with the original df.
Use pd.concat:
In [8]: x = df.apply(fxn) # Apply function on df and store result separately
In [10]: df = pd.concat([df, x], axis=1) # Concat with original df to get all columns
Rename duplicate column names by adding suffixes:
In [82]: from collections import Counter
In [38]: mylist = df.columns.tolist()
In [41]: d = {a:list(range(1, b+1)) if b>1 else '' for a,b in Counter(mylist).items()}
In [62]: df.columns = [i+str(d[i].pop(0)) if len(d[i]) else i for i in mylist]
In [63]: df
Out[63]:
A1 B1 A2 B2
0 0 10 0 20
1 1 9 2 18
2 2 8 4 16
3 3 7 6 14
4 4 6 8 12
5 5 5 10 10
6 6 4 12 8
7 7 3 14 6
8 8 2 16 4
9 9 1 18 2
You can assign directly with:
df[df.columns + '2'] = df.apply(fxn)
Outut:
A B A2 B2
0 0 10 0 20
1 1 9 2 18
2 2 8 4 16
3 3 7 6 14
4 4 6 8 12
5 5 5 10 10
6 6 4 12 8
7 7 3 14 6
8 8 2 16 4
9 9 1 18 2
Alternatively, you can leverage the #MayankPorwal answer by using .add_suffix('2') to the output from your apply function:
pd.concat([df, df.apply(fxn).add_suffix('2')], axis=1)
which will return the same output.
In your function, name=col.name+'2' is doing nothing (it's basically returning just col * 2). That's because apply returns the values back to the original column.
Anyways, it's possible to take the MayankPorwal approach: pd.concat + managing duplicated columns (make them unique). Another possible way to do that:
# Use pd.concat as mentioned in the first answer from Mayank Porwal
df = pd.concat([df, df.apply(fxn)], axis=1)
# Rename duplicated columns
suffix = (pd.Series(df.columns).groupby(df.columns).cumcount()+1).astype(str)
df.columns = df.columns + suffix.rename('1', '')
which returns the same output, and additionally manage further duplicated columns.
Answer on the behalf of OP:
This code does what I wanted:
import pandas as pd
# Simulated business logic: for an input row, return a number of columns
# related to the input, and generate names for them, such that we don't
# know the shape of the output or the names of its columns before the call.
def fxn(row):
length = row[0]
indicies = [row.index[0] + str(i) for i in range(0, length)]
series = pd.Series([i for i in range(0, length)], index=indicies)
return series
# Sample data: 0 to 18, inclusive, counting by 2.
df1 = pd.DataFrame(list(range(0, 20, 2)), columns=['A'])
# Randomize the rows to simulate different input shapes.
df1 = df1.sample(frac=1)
# Apply fxn to rows to get new columns (with expand). Concat to keep inputs.
df1 = pd.concat([df1, df1.apply(fxn, axis=1, result_type='expand')], axis=1)
print(df1)

Python pandas constructing dataframe by looping over columns

I am trying to develop a new panda dataframe based on data I got from an existing dataframe and then taking into account the previously calculated value in the new dataframe.
As an example, here are two dataframes with the same size.
df1 = pd.DataFrame(np.random.randint(0,10, size = (5, 4)), columns=['1', '2', '3', '4'])
df2 = pd.DataFrame(np.zeros(df1.shape), index=df1.index, columns=df1.columns)
Then I created a list which starts as a starting basis for my second dataframe df2
L = [2,5,6,7]
df2.loc[0] = L
Then for the remaining rows of df2 I want to take the value from the previous time step (df2) and add the value of df1.
for i in df2.loc[1:]:
df2.ix[i] = df2.ix[i-1] + df1
As an example my dataframes should look like this:
>>> df1
1 2 3 4
0 4 6 0 6
1 7 0 7 9
2 9 1 9 9
3 5 2 3 6
4 0 3 2 9
>>> df2
1 2 3 4
0 2 5 6 7
1 9 5 13 16
2 18 6 22 25
3 23 8 25 31
4 23 11 27 40
I know there is something wrong with the indication of indexes in the for loop but I cannot figure out how the argument must be formulated. I would be very thankful for any help on this.
this is a simple cumsum.
df2 = df1.copy()
df2.loc[0] = [2,5,6,7]
desired_df = df2.cumsum()

Categories