Pandas dataframe with N columns - python

I need to use Python with Pandas to write a DataFrame with N columns. This is a simplified version of what I have:
Ind=[[1, 2, 3],[4, 5, 6],[7, 8, 9],[10, 11, 12]]
DAT = pd.DataFrame([Ind[0],Ind[1],Ind[2],Ind[3]], index=None).T
DAT.head()
Out
0 1 2 3
0 1 4 7 10
1 2 5 8 11
2 3 6 9 12
This is the result that I want, but my real Ind has 121 sets of points and I really don't want to write each one in the DataFrame's argument. Is there a way to write this easily? I tried using a for loop, but that didn't work out.

You can just pass the list directly:
data = [[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]]
df = pd.DataFrame(data, index=None).T
df.head()
Outputs:
0 1 2
0 1 2 3
1 4 5 6
2 7 8 9
3 10 11 12

Related

moving last two dataframe rows

I'm trying to move the last two rows up:
import pandas as pd
df = pd.DataFrame({
"A" : [1,2,3,4],
"C": [5, 6, 7, 8],
"D": [9, 10, 11, 12],
"E": [13, 14, 15, 16],
})
print(df)
Output:
A C D E
0 1 5 9 13
1 2 6 10 14
2 3 7 11 15
3 4 8 12 16
Desired output:
A C D E
0 3 7 11 15
1 4 8 12 16
2 1 5 9 13
3 2 6 10 14
I was able to move the last row using
df = df.reindex(np.roll(df.index, shift=1))
But can't get the second to last row to move as well. Any advice what's the most efficient way to do this without creating a copy of the dataframe?
Using your code, you can just change the roll's shift value.
import pandas as pd
import numpy as np
df = pd.DataFrame({
"A" : [1,2,3,4],
"C": [5, 6, 7, 8],
"D": [9, 10, 11, 12],
"E": [13, 14, 15, 16],
})
df = df.reindex(np.roll(df.index, shift=2), copy=False)
df.reset_index(inplace=True, drop=True)
print(df)
A C D E
0 3 7 11 15
1 4 8 12 16
2 1 5 9 13
3 2 6 10 14
The shift value will change how many rows are affected by the roll, and afterwards we just reset the index of the dataframe so that it goes back to 0,1,2,3.
Based on the comment of wanting to swap indexes 0 and 1 around, we can use an answer in #CatalinaChou's link to do that. I am choosing to do it after using the roll so as to only have to contend with indexes 0 and 1 after it's been shifted.
# continuing from where the last code fence ends
swap_indexes = {1: 0, 0: 1}
df.rename(swap_indexes, inplace=True)
df.sort_index(inplace=True)
print(df)
A C D E
0 4 8 12 16
1 3 7 11 15
2 1 5 9 13
3 2 6 10 14
A notable difference is the use of inplace=True and thus not being able to chain the methods, but this would be to fulfil not copying the dataframe at all (or as much as possible, I'm not sure if df.reindex will make an internal copy with copy=False).

Python use dataframe column value in iloc (or shift)

Although my previous question was answered here Python dataframe new column with value based on value in other row I still want to know how to use a column value in iloc (or shift or rolling, etc.)
I have a dataframe with two columns A and B, how do I use the value of column B in iloc? Or shift()?
d = {'A': [8, 2, 4, 5, 6, 4, 3, 5, 5, 3], 'B': [2, -1, 4, 5, 0, -3, 8, 2, 6, -1]}
df = pd.DataFrame(data=d)
Using iloc I get this error.
df['C'] = df['A'] * df['A'].iloc[df['B']]
ValueError: cannot reindex from a duplicate axis
Using shift() another one.
df['C'] = df['A'] * df['A'].shift(df['B'])
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Is it possible what I want to do? If yes, how? If no, why not?
Use numpy indexing:
print (df['A'].to_numpy()[df['B'].to_numpy()])
[4 3 6 4 8 5 5 4 3 3]
df['C'] = df['A'] * df['A'].to_numpy()[df['B'].to_numpy()]
print (df)
A B C
0 8 2 32
1 2 -1 6
2 4 4 24
3 5 5 20
4 6 0 48
5 4 -3 20
6 3 8 15
7 5 2 20
8 5 6 15
9 3 -1 9
numpy indexing is the fastest way i agree, but you can use list comprehension + iloc too:
d = {'A': [8, 2, 4, 5, 6, 4, 3, 5, 5, 3], 'B': [2, -1, 4, 5, 0, -3, 8, 2, 6, -1]}
df = pd.DataFrame(data=d)
df['C'] = df['A'] * [df['A'].iloc[i] for i in df['B']]
A B C
0 8 2 32
1 2 -1 6
2 4 4 24
3 5 5 20
4 6 0 48
5 4 -3 20
6 3 8 15
7 5 2 20
8 5 6 15
9 3 -1 9

In pandas dataframe - returning last value of cumulative sum that satisfies condition

index [0, 1, 2, 3, 4, 5]
part_1 [4, 5, 6, 4, 8, 4]
part_2 [11, 12, 10, 12, 14, 13]
new [6, 4, 8, 8, na, na]
I'm a beginner in python & pandas asking for support. In a simple dataframe, I want to create a new column that gives me the last row of a cumulative sum that satisfies the condition
df.part_1.cumsum() > df.part_2
So e.g. for the new column at index 0 I would get the value 6 as (4+5+6) > 11.
Thanks!
IIUC here a NumPy based approach. The idea is to build an upper triangular matrix, with shifted versions of the input array in each row. By taking the cumulative sum of these, and comparing against the second column of the dataframe, we can find using argmax the first index where a value in the cumulative sequences is greater than the third dataframe column in the corresponding index:
a = df.to_numpy()
cs = np.triu(a[:,1]).cumsum(1)
ix = (cs >= a[:,2,None]).argmax(1)
# array([2, 3, 3, 4, 6, 7, 7, 0], dtype=int64)
df['first_ix'] = a[ix,1,None]
print(df)
index part_1 part_2 first_ix
0 0 4 11 6
1 1 5 12 4
2 2 6 10 4
3 3 4 12 8
4 4 8 14 6
5 5 4 13 8
6 6 6 11 8
7 7 8 10 4

Extract data from a dataframe

I have a list based upon which I want to retrieve data from a dataset.
Here is the list:
packed = [1, 5, 8, 2, 3, 3, 7, 3, 7, 7, 4, 6, 3]
and here is the dataset
There are two items with multiple quantity i.e. 3 and 7
I want to extract those rows which are not in packed list. In this case its two times 7(rest 3 are in the list already)
How can I do that? I tried this but this doesn't work
new_df= data[~data["Pid"].isin(packed)].reset_index(drop=True)
Use GroupBy.cumcount with helper DataFrame, merge with left join and indicator=True and last filter by boolean indexing:
packed = [1, 5, 8, 2, 3, 3, 7, 3, 7, 7, 4, 6, 3]
df1 = pd.DataFrame({'Pid':packed})
df1['g'] = df1.groupby('Pid').cumcount()
print (df1)
Pid g
0 1 0
1 5 0
2 8 0
3 2 0
4 3 0
5 3 1
6 7 0
7 3 2
8 7 1
9 7 2
10 4 0
11 6 0
12 3 3
data['g'] = data.groupby('Pid').cumcount()
new_df = data[data.merge(df1, indicator=True, how='left')['_merge'].eq('left_only')]

Finding consecutive segments in a pandas data frame

I have a pandas.DataFrame with measurements taken at consecutive points in time. Along with each measurement the system under observation had a distinct state at each point in time. Hence, the DataFrame also contains a column with the state of the system at each measurement. State changes are much slower than the measurement interval. As a result, the column indicating the states might look like this (index: state):
1: 3
2: 3
3: 3
4: 3
5: 4
6: 4
7: 4
8: 4
9: 1
10: 1
11: 1
12: 1
13: 1
Is there an easy way to retrieve the indices of each segment of consecutively equal states. That means I would like to get something like this:
[[1,2,3,4], [5,6,7,8], [9,10,11,12,13]]
The result might also be in something different than plain lists.
The only solution I could think of so far is manually iterating over the rows, finding segment change points and reconstructing the indices from these change points, but I have the hope that there is an easier solution.
One-liner:
df.reset_index().groupby('A')['index'].apply(np.array)
Code for example:
In [1]: import numpy as np
In [2]: from pandas import *
In [3]: df = DataFrame([3]*4+[4]*4+[1]*4, columns=['A'])
In [4]: df
Out[4]:
A
0 3
1 3
2 3
3 3
4 4
5 4
6 4
7 4
8 1
9 1
10 1
11 1
In [5]: df.reset_index().groupby('A')['index'].apply(np.array)
Out[5]:
A
1 [8, 9, 10, 11]
3 [0, 1, 2, 3]
4 [4, 5, 6, 7]
You can also directly access the information from the groupby object:
In [1]: grp = df.groupby('A')
In [2]: grp.indices
Out[2]:
{1L: array([ 8, 9, 10, 11], dtype=int64),
3L: array([0, 1, 2, 3], dtype=int64),
4L: array([4, 5, 6, 7], dtype=int64)}
In [3]: grp.indices[3]
Out[3]: array([0, 1, 2, 3], dtype=int64)
To address the situation that DSM mentioned you could do something like:
In [1]: df['block'] = (df.A.shift(1) != df.A).astype(int).cumsum()
In [2]: df
Out[2]:
A block
0 3 1
1 3 1
2 3 1
3 3 1
4 4 2
5 4 2
6 4 2
7 4 2
8 1 3
9 1 3
10 1 3
11 1 3
12 3 4
13 3 4
14 3 4
15 3 4
Now groupby both columns and apply the lambda function:
In [77]: df.reset_index().groupby(['A','block'])['index'].apply(np.array)
Out[77]:
A block
1 3 [8, 9, 10, 11]
3 1 [0, 1, 2, 3]
4 [12, 13, 14, 15]
4 2 [4, 5, 6, 7]
You could use np.diff() to test where a segment starts/ends and iterate over those results. Its a very simple solution, so probably not the most performent one.
a = np.array([3,3,3,3,3,4,4,4,4,4,1,1,1,1,4,4,12,12,12])
prev = 0
splits = np.append(np.where(np.diff(a) != 0)[0],len(a)+1)+1
for split in splits:
print np.arange(1,a.size+1,1)[prev:split]
prev = split
Results in:
[1 2 3 4 5]
[ 6 7 8 9 10]
[11 12 13 14]
[15 16]
[17 18 19]

Categories