Pandas jumbling up data after selecting columns - python

I have a (2.3m x 33) size dataframe. As I always do when selecting columns to keep, I use
colsToKeep = ['A','B','C','D','E','F','G','H','I']
df = df[colsToKeep]
However, this time the data under these columns becomes completely jumbled up on running the code. Entries for row A might be in row D for example. Totally at random.
Has anybody experienced this kind of behavior before? There is nothing out of the ordinary about the data and the df is totally fine before running these lines. Code run before problem begins:
with open('file.dat','r') as f:
df = pd.DataFrame(l.rstrip().split() for l in f)
#rename columns with the first row
df.columns = df.iloc[0]
#drop first row which is now duplicated
df = df.iloc[1:]
#. 33 nan columns - Remove all the nan columns that appeared
df = df.loc[:,df.columns.notnull()]
colsToKeep = ['A','B','C','D','E','F','G','H','I']
df = df[colsToKeep]
Data suddenly goes from being nicely formatted such as:
A B C D E F G H I
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
to something more random like:
A B C D E F G H I
7 9 3 4 5 1 2 8 6
3 2 9 2 1 6 7 8 4
2 1 3 6 5 4 7 9 8

Related

Add all column values repeated of one data frame to other in pandas

Having two data frames:
df1 = pd.DataFrame({'a':[1,2,3],'b':[4,5,6]})
a b
0 1 4
1 2 5
2 3 6
df2 = pd.DataFrame({'c':[7],'d':[8]})
c d
0 7 8
The goal is to add all df2 column values to df1, repeated and create the following result. It is assumed that both data frames do not share any column names.
a b c d
0 1 4 7 8
1 2 5 7 8
2 3 6 7 8
If there are strings columns names is possible use DataFrame.assign with unpack Series created by selecing first row of df2:
df = df1.assign(**df2.iloc[0])
print (df)
a b c d
0 1 4 7 8
1 2 5 7 8
2 3 6 7 8
Another idea is repeat values by df1.index with DataFrame.reindex and use DataFrame.join (here first index value of df2 is same like first index value of df1.index):
df = df1.join(df2.reindex(df1.index, method='ffill'))
print (df)
a b c d
0 1 4 7 8
1 2 5 7 8
2 3 6 7 8
If no missing values in original df is possible use forward filling missing values in last step, but also are types changed to floats, thanks #Dishin H Goyan:
df = df1.join(df2).ffill()
print (df)
a b c d
0 1 4 7.0 8.0
1 2 5 7.0 8.0
2 3 6 7.0 8.0

Python: dataframe pivot with duplicate labels

I have a panda dataframe as below
index ColumnName ColumnValue
0 A 1
1 B 2
2 C 3
3 A 4
4 B 5
5 C 6
6 A 7
7 B 8
8 C 9
I want ouput like below as panda dataframe
A B C
1 2 3
4 5 6
7 8 9
Can anyone sugget how i can i achieve desired output ?
Regards
Vipul
First solution came into my mind is to use for loop with unique columnName as below. If you want pivot method to achieve it, someone else might help you.
columns = df['ColumnName'].unique()
data = {}
for column in columns:
data[column] = list(df[df['ColumnName'] == column]['ColumnValue'])
pd.DataFrame(data)
which will give you the below output
A B C
0 1 2 3
1 4 5 6
2 7 8 9

Move dataframe column below the other(pandas Python)

I want to shift especific column down by one (I dont know if other library can help me)
import pandas as pd
#pd.set_option('display.max_rows',100)
fac=pd.read_excel('TEST.xlsm',sheet_name="DC - Consumables",header=None, skiprows=1)
df = pd.DataFrame(fac)
df1=df.iloc[0:864,20:39]
df2=df.iloc[0:864,40:59]
df1=pd.concat([df1,df2])
print (df1)
I want one column to be below the other column
A B C` A B C`
1 2 3` 6 7 8`
4 5 8` 4 1 9`
my code print this
A B C
1 2 3
4 5 8
A B C
6 7 8
4 1 9
I need the second column (dataframe) to be below the first column, like this:
A B C
1 2 3
4 5 8
A B C
6 7 8
4 1 9
Please help me
Try pd.concat().
df3 = pd.concat([df1, df2])

Pandas Data Frame to_csv with more separator

I have a file of 40 columns and 600 000 rows. After processing it in pandas dataframe, i would like to save the data frame to csv with different spacing length. There is a sep kwarg in df.to_csv, i tried with regex, but i'm getting error
TypeError: "delimiter" must be an 1-character string.
I want the output with different column spacing, as shown below
A B C D E F G
1 3 5 8 8 9 8
1 3 5 8 8 9 8
1 3 5 8 8 9 8
1 3 5 8 8 9 8
1 3 5 8 8 9 8
Using the below code i'm getting the tab delimited. which are all with same spacing.
df.to_csv("D:\\test.txt", sep = "\t", encoding='utf-8')
A B C D E F G
1 3 5 8 8 9 8
1 3 5 8 8 9 8
1 3 5 8 8 9 8
1 3 5 8 8 9 8
1 3 5 8 8 9 8
I don't want to do looping, It might take lot of time for 600k lines.
Thank you for comments, It helped me.
Below is the code.
import pandas as pd
#Create DataFrame
df = pd.DataFrame({'A':[0,1,2,3],'B':[0,11,2,333],'C':[0,1,22,3],'D':[00,1,2,33]})
#Convert the Columns to string
df[df.columns]=df[df.columns].astype(str)
#Create the list of column separator width
SepWidth = [5,6,3,8]
#Temp dict
tempdf = {}
#Convert all the column to series
for i, eCol in enumerate(df):
tempdf[i] = pd.Series(df[eCol]).str.pad(width=SepWidth[i])
#Final DataFrame
Fdf = pd.concat(tempdf, axis=1)
#print Fdf
#Export to csv
Fdf.to_csv("D:\\test.txt", sep='\t', index=False, header=False, encoding='utf-8')
output of test.txt
0 0 0 0
1 11 1 1
2 2 22 2
3 333 3 33
UPDATE
Tab delimited ('\t') was included in spacing, while using pandas.to_csv. Behalf of pandas.to_csv i'm using below code to save as txt.
numpy.savttxt(file, df.values, fmt='%s')

Deleting multiple DataFrame columns in Pandas

Say I want to delete a set of adjacent columns in a DataFrame and my code looks something like this currently:
del df['1'], df['2'], df['3'], df['4'], df['5'], df['6']
This works, but I was wondering if there was a more efficient, compact, or aesthetically pleasing way to do it, such as:
del df['1','6']
I think you need drop, for selecting is used range or numpy.arange:
df = pd.DataFrame({'1':[1,2,3],
'2':[4,5,6],
'3':[7,8,9],
'4':[1,3,5],
'5':[7,8,9],
'6':[1,3,5],
'7':[5,3,6],
'8':[5,3,6],
'9':[7,4,3]})
print (df)
1 2 3 4 5 6 7 8 9
0 1 4 7 1 7 1 5 5 7
1 2 5 8 3 8 3 3 3 4
2 3 6 9 5 9 5 6 6 3
print (np.arange(1,7))
[1 2 3 4 5 6]
print (range(1,7))
range(1, 7)
#convert string column names to int
df.columns = df.columns.astype(int)
df = df.drop(np.arange(1,7), axis=1)
#another solution with range
#df = df.drop(range(1,7), axis=1)
print (df)
7 8 9
0 5 5 7
1 3 3 4
2 6 6 3
You can do this without modifying the columns, by passing a slice object to drop:
In [29]:
df.drop(df.columns[slice(df.columns.tolist().index('1'),df.columns.tolist().index('6')+1)], axis=1)
Out[29]:
7 8 9
0 5 5 7
1 3 3 4
2 6 6 3
So this returns the ordinal position of the lower and upper bound of the column end points and passes these to create a slice object against the columns array

Categories