Select a column in dataframe from csv - python

I am trying to select the 'Name' column from a sample csv file named gradesM3.csv.
I have been following this tutorial but when it comes to selecting a single column, it doesn't work anymore.
My code:
import pandas as pd
df = pd.read_csv('gradesM3.csv')
df
The output:
Out[9]:
StudentID;Name;Assignment1;Assignment2;Assignment3
0 s123456;Michael Andersen;11;7;-3
1 s123789;Bettina Petersen;0;4;10
2 s123579;Marie Hansen;10;4;7
I believe there's already something wrong here as from what I've seen on other discussions, it's supposed to look more like a table.
When I try to display only the 'Name' column, with this command:
df['Name']
It returns:
KeyError: 'Name'
To sum up, I am trying to import my CSV file as a proper dataframe so I can work with it
Thanks

SOLVED
Thanks to W-B's comment, it worked with this code:
df = pd.read_csv('gradesM3.csv',sep=';')

Related

How to drop the index after creating the csv file in pandas

I am trying to select couple of columns based on column heading with wild card and one more column. When I execute the below code , I am getting the expected result, but there is an index which is appearing. how to drop the index . Any suggestions.
infile:
dir,name,ct1,cn1,ct2,cn2
991,name1,em,a#email.com,ep,1234
999,name2,em,b#email.com,ep,12345
872,name3,em,c#email.com,ep,123456
here is the code which I used.
import pandas as pd
df=pd.read_csv('infile.csv')
df_new=df.loc[:,df.columns.str.startswith('c')]
df_new_1=pd.read_csv('name.csv', usecols= ['dir'])
df_merge=pd.concat([df_new,df_new_1],axis=1, join="inner")
df_merge.to_csv('outfile.csv')
Pass false for index when you save to csv :
df_merge.to_csv('outfile.csv', index=False)

Pandas dataframe

I want to import an excel where I want to keep just some columns.
This is my code:
df=pd.read_excel(file_location_PDD)
col=df[['hkont','dmbtr','belnr','monat','gjahr','budat','shkzg','shkzg','usname','sname','dmsol','dmhab']]
print(col)
col.to_excel("JETNEW.xlsx")
I selected all the columns which I want it but 2 names of columns don't appear all time in the files which I have to import and these columns are 'usname' and 'sname'.
Cause of that I received an error ['usname','sname'] not in index
How can I do this ?
Thanks
Source -- https://stackoverflow.com/a/38463068/14515824
You need to use df.reindex instead of df[[]]. I also have changed 'excel.xlsx' to r'excel.xlsx' to specify to only read the file.
An example:
df.reindex(columns=['a','b','c'])
Which in your code would be:
file_location_PDD = r'excel.xlsx'
df = pd.read_excel(file_location_PDD)
col = df.reindex(columns=['hkont','dmbtr','belnr','monat','gjahr','budat','shkzg','shkzg','usname','sname','dmsol','dmhab'])
print(col)
col.to_excel("output.xlsx")

How to store the string of the column in excel using python

I have attached a screenshot of my excel sheet. I want to store the length of every string in SUPPLIER_id Length column. But when I run my code, CSV columns are blanks.
And when I use this same code in different CSV, it works well.
I am using following code but not able to print the data.
I have attached the snippet of csv. Can somebody tell me why is this happening:
import pandas as pd
data = pd.read_csv(r'C:/Users/patesari/Desktop/python work/nba.csv')
df = pd.DataFrame(data, columns= ['SUPPLIER_ID','ACTION'])
data.dropna(inplace = True)
data['SUPPLIER_ID']= data['SUPPLIER_ID'].astype(str)
data['SUPPLIER_ID LENGTH']= data['SUPPLIER_ID'].str.len()
data['SUPPLIER_ID']= data['SUPPLIER_ID'].astype(float)
data
print(df)
data.to_csv("C:/Users/patesari/Desktop/python work/nba.csv")
I faced a similar problem in the past.
Instead of:
df = pd.DataFrame(data, columns= ['SUPPLIER_ID','ACTION'])
Type this:
data.columns=['SUPPLIER_ID','ACTION']
Also, I don't understand why did you create DataFrame df. It was unnecessary in my opinion.
Aren't you getting a SettingWithCopyWarning from pandas? I would imagine (haven't ran this code) that these lines
data['SUPPLIER_ID']= data['SUPPLIER_ID'].astype(str)
data['SUPPLIER_ID LENGTH']= data['SUPPLIER_ID'].str.len()
data['SUPPLIER_ID']= data['SUPPLIER_ID'].astype(float)
would not do anything, and should be replaced with
data.loc[:, 'SUPPLIER_ID']= data['SUPPLIER_ID'].astype(str)
data.loc[:, 'SUPPLIER_ID LENGTH']= data['SUPPLIER_ID'].str.len()
data.loc[:, 'SUPPLIER_ID']= data['SUPPLIER_ID'].astype(float)

How to use 'loc' for column selection of a dataframe in dask

Anyone can tell me how i should select one column with 'loc' in a dataframe using dask?
As a side note, when i am loading the dataframe using dd.read_csv with header equals to "None", the column name is starting from zero to 131094. I am about to select the last column with column name as 131094, and i get the error.
code:
> import dask.dataframe as dd
> df = dd.read_csv('filename.csv', header=None)
> y = df.loc['131094']
error:
File "/usr/local/dask-2018-08-22/lib/python2.7/site-packages/dask-0.5.0-py2.7.egg/dask/dataframe/core.py", line 180, in _loc
"Can not use loc on DataFrame without known divisions")
ValueError: Can not use loc on DataFrame without known divisions
Based on this guideline http://dask.pydata.org/en/latest/dataframe-indexing.html#positional-indexing, my code should work right but don't know what causes the problem.
If you have a named column, then use: df.loc[:,'col_name']
But if you have a positional column, like in your example where you want the last column, then use the answer by #user1717828.
I tried this on a dummy csv and it worked. I can't help you for sure without seeing the file giving you problems. That said, you might be picking rows, not columns.
Instead, try this.
import dask.dataframe as dd
df = dd.read_csv('filename.csv', header=None)
y = df[df.columns[-1]]

How to label columns by reading in from a file in pandas

My real issue is that I cant seem to use the header names after I use them but I think its caused by labeling the headers wrong.
My code is as follows:
import pandas as pd
dataFrame1 = pd.read_csv('C:/Users/Desktop/data/data/featurenames.txt', header=None, encoding='utf-8')
dataFrame2 = pd.read_csv('C:/Users/Desktop/data/data/DataSet.txt')
dataFrame2.columns=[dataFrame1]
The result is the following:
If I use print (dataFrame2)
I get this result The headers are in brackets for some reason
`
But if I use print (dataFrame2['id'])
I get - KeyError: 'id'
Can anyone help me with this?
Look at dataFrame2.columns there will be correct column names.
You could use
dataFrame2 = pd.read_csv('C:/Users/Desktop/data/data/DataSet.txt',header=None,names=dataFrame1)

Categories