I am trying to add an empty column after the 3ed column on my data frame that contains 5 columns. Example:
Fname,Lname,city,state,zip
mike,smith,new york,ny,11101
This is what I have and below I am going to show what I want it to look like.
Fname,Lname,new column,city,state,zip
mike,smith,,new york,ny,11101
I dont want to populate that column with data all I want to do is add the extra column in the header and that data will have the blank column aka ',,'.
Ive seen examples where a new column is added to the end of a data frame but not at a specific placement.
you should use
df.insert(loc, column, value)
with loc being the index and column the column name and value it's value
for an empty column
df.insert(loc=2, column='new col', value=['' for i in range(df.shape[0])])
Use reindex or column filtering
df = pd.DataFrame(np.arange(50).reshape(10,-1), columns=[*'ABCDE'])
df['z']= np.nan
df[['A','z','B','C','D','E']]
OR
df.reindex(['A','z','B','C','D','E'], axis=1)
Output:
A z B C D E
0 0 NaN 1 2 3 4
1 5 NaN 6 7 8 9
2 10 NaN 11 12 13 14
3 15 NaN 16 17 18 19
4 20 NaN 21 22 23 24
5 25 NaN 26 27 28 29
6 30 NaN 31 32 33 34
7 35 NaN 36 37 38 39
8 40 NaN 41 42 43 44
9 45 NaN 46 47 48 49
You can simply go for df.insert()
import pandas as pd
data = {'Fname': ['mike'],
'Lname': ['smith'],
'city': ['new york'],
'state': ['ny'],
'zip': [11101]}
df = pd.DataFrame(data)
df.insert(1, "Address", '', True)
print(df)
Output:
Fname Address Lname city state zip
0 mike smith new york ny 11101
Related
In a pandas dataframe, I want to create a new column that calculates the average of column values of 4th, 8th and 12th row before our present row.
As shown in the table below, for row number 13 :
Value in Existing column that is 4 rows before row 13 (row 9) = 4
Value in Existing column that is 8 rows before row 13 (row 5) = 6
Value in Existing column that is 12 rows before row 13 (row 1) = 2
Average of 4,6,2 is 4. Hence New Column = 4 at row number 13, for the remaining rows between 1-12, New Column = Nan
I have more rows in my df, but I added only first 13 rows here for illustration.
Row number
Existing column
New column
1
2
NaN
2
4
NaN
3
3
NaN
4
1
NaN
5
6
NaN
6
4
NaN
7
8
NaN
8
2
NaN
9
4
NaN
10
9
NaN
11
2
NaN
12
4
NaN
13
3
3
.shift() is your missing part. We can use it to access previous rows from the existing row in a Pandas dataframe.
Let's use .groupby(), .apply() and .shift() as follows:
df['New column'] = df.groupby((df['Row number'] - 1) // 13)['Existing column'].apply(lambda x: (x.shift(4) + x.shift(8) + x.shift(12)) / 3)
Here, rows are partitioned into groups of 13 rows by grouping them under different group numbers set by (df['Row number'] - 1) // 13
Then within each group, we use .apply() on the column Existing column and use .shift() to get the previous 4th, 8th and 12th entries within the group.
Test Run
data = {'Row number' : np.arange(1, 40), 'Existing column': np.arange(11, 50) }
df = pd.DataFrame(data)
print(df)
Row number Existing column
0 1 11
1 2 12
2 3 13
3 4 14
4 5 15
5 6 16
6 7 17
7 8 18
8 9 19
9 10 20
10 11 21
11 12 22
12 13 23
13 14 24
14 15 25
15 16 26
16 17 27
17 18 28
18 19 29
19 20 30
20 21 31
21 22 32
22 23 33
23 24 34
24 25 35
25 26 36
26 27 37
27 28 38
28 29 39
29 30 40
30 31 41
31 32 42
32 33 43
33 34 44
34 35 45
35 36 46
36 37 47
37 38 48
38 39 49
df['New column'] = df.groupby((df['Row number'] - 1) // 13)['Existing column'].apply(lambda x: (x.shift(4) + x.shift(8) + x.shift(12)) / 3)
print(df)
Row number Existing column New column
0 1 11 NaN
1 2 12 NaN
2 3 13 NaN
3 4 14 NaN
4 5 15 NaN
5 6 16 NaN
6 7 17 NaN
7 8 18 NaN
8 9 19 NaN
9 10 20 NaN
10 11 21 NaN
11 12 22 NaN
12 13 23 15.0
13 14 24 NaN
14 15 25 NaN
15 16 26 NaN
16 17 27 NaN
17 18 28 NaN
18 19 29 NaN
19 20 30 NaN
20 21 31 NaN
21 22 32 NaN
22 23 33 NaN
23 24 34 NaN
24 25 35 NaN
25 26 36 28.0
26 27 37 NaN
27 28 38 NaN
28 29 39 NaN
29 30 40 NaN
30 31 41 NaN
31 32 42 NaN
32 33 43 NaN
33 34 44 NaN
34 35 45 NaN
35 36 46 NaN
36 37 47 NaN
37 38 48 NaN
38 39 49 41.0
You can use rolling with .apply to apply a custom aggregation function.
The average of (4,6,2) is 4, not 3
>>> (2 + 6 + 4) / 3
4.0
>>> df["New column"] = df["Existing column"].rolling(13).apply(lambda x: x.iloc[[0, 4, 8]].mean())
>>> df
Row number Existing column New column
0 1 2 NaN
1 2 4 NaN
2 3 3 NaN
3 4 1 NaN
4 5 6 NaN
5 6 4 NaN
6 7 8 NaN
7 8 2 NaN
8 9 4 NaN
9 10 9 NaN
10 11 2 NaN
11 12 4 NaN
12 13 3 4.0
breaking it down:
df["Existing column"]: select "Existing column" from the dataframe
.rolling(13): starting with the first 13 rows, we're going to move a sliding window across all of the data. So first, we will encounter rows 0-12, then rows 1-13, then 2-14, so on and so forth.
.apply(...): For each of those aforementioned rolling sections, we're going to apply a function that works on each section (in this case the function we're applying is the lambda.
lambda x: x.iloc[[0, 4, 8]].mean(): from each of those rolling sections, extract the 0th 4th, and 8th (corresponding to row 1, 5, & 9) and calculate and return the mean of those values.
In order to work on your dataframe in chunks (or groups) instead of a sliding window, you can apply the same logic with the .groupby method (instead of .rolling).
>>> groups = np.arange(len(df)) // 13 # defines groups as chunks of 13 rows
>>> averages = (
df.groupby(groups)["Existing column"]
.apply(lambda x: x.iloc[[0, 4, 8]].mean())
)
>>> averages.index = (averages.index + 1) * 13 - 1
>>> df["New column"] = averages
>>> df
Row number Existing column New column
0 1 2 NaN
1 2 4 NaN
2 3 3 NaN
3 4 1 NaN
4 5 6 NaN
5 6 4 NaN
6 7 8 NaN
7 8 2 NaN
8 9 4 NaN
9 10 9 NaN
10 11 2 NaN
11 12 4 NaN
12 13 3 4.0
breaking it down now:
groups = np.arange(len(df)): creates an array that will be used to chunk our dataframe into groups. This array will essentially be 13 0s, followed by 13 1s, follow by 13 2s... until the array is the same length as the dataframe. So in this case for a single chunk example it will only be an array of 13 0s.
df.groupby(groups)["Existing column"] group the dataframe according to the groups defined above and select the "Existing column"
.apply(lambda x: x.iloc[[0, 4, 8]].mean()): Conceptually the same as before, except we're applying to each grouping instead of a sliding window.
averages.index = (averages.index + 1) * 12: this part may seem a little odd. But we're essentially ensuring that our selected averages line up with the original dataset correctly. In this case, we want the average from group 0 (specified with an index value of 0 in the averages Series) to align to row 12. If we had another group (group 1, we would want it to align to row 25 in the original dataset). So we can use a little math to do this transformation.
df["New column"] = averages: since we already matched up our indices, pandas takes care of the actual alignment of these new values under the hood for us.
I have an existing dataframe like this
>>> print(dataframe)
sid
30 11
56 5
73 25
78 2
132 1
..
8531 25
8616 2
9049 1
9125 6
9316 11
Name: name, Length: 87, dtype: int64
I want to add a row like {'sid': 2, '': 100} to it but when I try this
df = pandas.DataFrame({'sid': [2], '': [100]})
df = df.set_index('sid')
dataframe = dataframe.append(df)
print(dataframe)
I end up with
sid
30 11.0 NaN
56 5.0 NaN
73 25.0 NaN
78 2.0 NaN
132 1.0 NaN
... ... ...
8616 2.0 NaN
9049 1.0 NaN
9125 6.0 NaN
9316 11.0 NaN
2 NaN 100.0
I'm hoping for something more like
sid
2 100
30 11
56 5
73 25
78 2
132 1
..
8531 25
8616 2
9049 1
9125 6
9316 11
Any idea how I can achieve that?
The way to do this was
dataframe.loc[2] = 100
Thanks anky!
Reason for the above problem, because at the time you have appended two DataFrames, you forgot to set 'sid' as the dataframe index. So, basically the two DataFrames has different structure when you append it. Make sure to set the index of both dataframes same before you append them.
data = [ [30,11], [56, 5], [73, 25]] #test dataframe
dataframe = pd.DataFrame(data, columns=['sid', ''])
dataframe = dataframe.set_index('sid')
print(dataframe)
You get,
sid
30 11
56 5
73 25
Create and set the index of df,
df = pd.DataFrame({'sid' : [2], '' : [100]})
df = df.set_index('sid')
You get,
sid
2 100
Then append them,
dataframe = df.append(dataframe)
print(dataframe)
You will get the disired outcome,
sid
2 100
30 11
56 5
73 25
We have data from a device that measures multiple parts, and it outputs multiple measurements for each part into a CSV file. We read the CSV file into a dataframe with a structure such as this:
PartNo 12
Meas1 45
Meas2 23
!END
PartNo 13
Meas1 63
Meas2 73
!END
PartNo 12
Meas1 82
Meas2 84
!END
The "!END" flag indicates where the data from one part ends, and the next part starts.
We would like to reshape the data so it looks like:
PartNo Meas1 Meas2
12 45 23
13 63 73
12 82 84
(Note that a part could appear more than once - so there is no field that is guaranteed to be unique across all records.)
A pivot produces:
0 !END Meas1 Meas2 PartNo
0 NaN NaN NaN 12.0
1 NaN 45.0 NaN NaN
2 NaN NaN 23.0 NaN
3 NaN NaN NaN NaN
4 NaN NaN NaN 13.0
5 NaN 63.0 NaN NaN
6 NaN NaN 73.0 NaN
7 NaN NaN NaN NaN
8 NaN NaN NaN 12.0
9 NaN 82.0 NaN NaN
10 NaN NaN 84.0 NaN
11 NaN NaN NaN NaN
How do I squeeze these rows down to group by PartNo?
A transpose produces:
0 1 2 3 4 5 6 7 8 9 10 11
0 PartNo Meas1 Meas2 !END PartNo Meas1 Meas2 !END PartNo Meas1 Meas2 !END
1 12 45 23 NaN 13 63 73 NaN 12 82 84 NaN
How could I reset the row every 4th item?
I could create a new index column in the original dataframe, and then iterate through the rows incrementing the index for every row with !END (and then use the index to group the data),
but it seems that there ought to be a more elegant shape shifting function to handle this case, or maybe there is an argument to Pivot or Transpose that would handle this. I am a Python beginner.
Here is the full code:
import pandas as pd
from io import StringIO
tdata = (
'PartNo, 12\n'
'Meas1, 45\n'
'Meas2, 23\n'
'!END\n'
'PartNo, 13\n'
'Meas1, 63\n'
'Meas2, 73\n'
'!END\n'
'PartNo, 12\n'
'Meas1, 82\n'
'Meas2, 84\n'
'!END\n')
tdf = pd.read_csv(StringIO(tdata), header=None)
print(tdf)
print(tdf.pivot(index=None, columns=0, values=1))
print(tdf.T)
#having dataframe x:
>>> x = pd.DataFrame([['PartNo',12],['Meas1',45],['Meas2',23],['!END',''],['PartNo',13],['Meas1',63],['Meas2',73],['!END',''],['PartNo',12],['Meas1',82],['Meas2',84],['!END','']])
>>> x
0 1
0 PartNo 12
1 Meas1 45
2 Meas2 23
3 !END
4 PartNo 13
5 Meas1 63
6 Meas2 73
7 !END
8 PartNo 12
9 Meas1 82
10 Meas2 84
11 !END
#grouping by first column, and aggregating values to list. First column then contains Series that you want. By converting each list in this series to series, dataframe is created, then you just need to transpose
>>> df = x.groupby(0).agg(lambda x: list(x))[1].apply(lambda x: pd.Series(x)).transpose()
>>> df[['PartNo','Meas1','Meas2']]
0 PartNo Meas1 Meas2
0 12 45 23
1 13 63 73
2 12 82 84
The file is not a csv file, so parsing it with the csv module cannot produce a correct output. It is not a well known format, so I would use a custom parser:
with open(filename) as fd:
data = []
row = None
for line in fd:
line = line.strip()
if line == '!END':
row = None
else:
k,v = line.split(None, 1)
if row is None:
row = {k : v}
data.append(row)
else:
row[k] = v
header = set(i for row in data for i in row.keys())
df = pd.DataFrame(data, columns=header)
based on the information provided, I think you should be able to achieve what you want using this approach:
df = df[df[0] != '!END']
out = df.groupby(0).agg(list).T.apply(lambda x: x.explode(), axis=0)
output:
0 Meas1 Meas2 PartNo
1 45 23 12
1 63 73 13
1 82 84 12
This essentially groups the original df by the PartNo, Meas1 and Meas2 keys and makes a list for each..then it explodes each list into a pd.Series, thus making a column for each, with # of rows equal to number of entries in each key (should all be same)
Here is how I would do it. I would parse the file as any text file and then create a record based on fields I need. I would use the '!END' row as an indicator for completion of row creation to write it into a list and then ultimately convert list to a DataFrame
import pandas as pd
filename='PartDetail.csv'
with open(filename,'r') as file:
LinesFromFile=file.readlines()
RowToWrite=[]
for EachLine in LinesFromFile:
ValuePosition=EachLine.find(" ")+1
CurrentAttrib=EachLine[0:ValuePosition-1]
if CurrentAttrib=='PartNo':
PartNo=EachLine[ValuePosition+1:len(EachLine)-1].strip()
if CurrentAttrib=='Meas1':
Meas1=EachLine[ValuePosition+1:len(EachLine)-1].strip()
if CurrentAttrib=='Meas2':
Meas2=EachLine[ValuePosition+1:len(EachLine)-1].strip()
if EachLine[0:4]=='!END':
RowToWrite.append([PartNo,Meas1,Meas2])
PartsDataDF=pd.DataFrame(RowToWrite,columns=['PartNo','Meas1','Meas2']) #Converting to DataFrame
This will give you a cleaner DataFrame as below:-
Hope it helps.
I have a positional text file that has the related data split into two lines.
Column 1Column 2Column 3
Text
11 12 13
text for 1
21 22 23
text for 2
31 32 33
text for 3
41 42 43
text for 4
51 52 53
text for 5
I'm trying to get this into a dataframe like
Column 1Column 2Column 3 Text
11 12 13 text for 1
21 22 23 text for 2
31 32 33 text for 3
41 42 43 text for 4
51 52 53 text for 5
I'm testing without the column headers
import pandas as pd
cols=([(0,8),(8,16),(16,None),(0,50)])
rs=pd.read_fwf(fn,colspecs=cols,header=None)
gives me:
0 1 2 3
0 11 12 13.0 11 12 13
1 text for 1 NaN text for 1
2 21 22 23.0 21 22 23
3 text for 2 NaN text for 2
is there any way to alternate the formats of the lines
You can try to get every other row, and join the Text into one string as a new Text column, like this:
data = df.values.tolist()[::2][1:]
df = df[1:]
df = pd.DataFrame(df.values.tolist()[::2], columns=df.columns)
df['Text'] = [' '.join([str(x) for x in i[:-1]]) for i in data]
df = df.drop('3', axis=1)
print(df)
Result:
Column 1Column 2Column Text
0 11 12 13.0 text for 1.0
1 21 22 23.0 text for 2.0
2 31 32 33.0 text for 3.0
3 41 42 43.0 text for 4.0
4 51 52 53.0 text for 5.0
The first line of code creates an new list with the values of every other row in df, the second lines removes the first row from df, then the third line creates a new dataframe with every other row, then the fourth line creates the Text column with a list comprehension with the data list, the fifth row drops the 3rd column, because it's extra, the sixth row prints the data out.
I have a dataframe that looks like the below. What I'd like to do is create another column that is based on the VALUE of the index (so anything less the 10 would have another column and be labeled as "small"). I can do something like lengthDF[lengthDF.index < 10] to get the values I want, but I'm sure how to get the additional column I want. I've tried this Create Column with ELIF in Pandas but can't get it to read the index...
LengthFirst LengthOthers
0 1 NaN
4 NaN 1
9 NaN 1
13 NaN 1
17 1 1
18 NaN 1
19 NaN 1
20 1 NaN
21 1 1
22 3 4
23 1 NaN
24 7 6
25 1 2
26 16 19
27 1 2
28 24 8
29 9 12
30 73 65
31 15 12
32 55 60
33 28 21
34 29 31
Something like this?
lengthDF['size'] = 'large'
lengthDF['size'][lengthDF.index < 10] = 'small'