Pandas read_csv does not separate values after comma - python

I am trying to load some .csv data in the Jupyter notebook but for some reason, it does not separate my data but puts everything in a single column.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df =
pd.read_csv(r'C:\Users\leonm\Documents\Fontys\Semester
4\GriefBot_PracticeChallenge\DummyDataGriefbot.csv')
df.head()
My csv data
In this picture there is the data I am using.
And now I do not understand why my values all come in single column and are not separated where the comas are.
I have also tried both spe=',' and sep=';' but they do not change anything.
This is what I am getting
I would really appreciate your help.

If that's how your data looks in a CSV reader like Excel, then each row likely looks like one big string in a text editor.
"ID,PERSON,DATE"
"1,A. Molina,1593147221"
"2,A. Moran, 16456"
"3,Action Marquez,15436"
You could of course do "text to columns" within Excel and resave your file, or if you have many of these files, you can use the Pandas split function.
df[df.columns[0].split(',')] = df.iloc[:,0].str.split(',', expand=True)
# ^ split header by comma ^ ^ create list split by comma, and expand
# | each list entry into a new column
# | select first column of data
df.head()
> ID,PERSON,DATE ID PERSON DATE
> 0 1,A. Molina,1593147221 1 A. Molina 1593147221
> 1 2,A. Moran, 16456 2 A. Moran 16456
> 2 3,Action Marquez,15436 3 Action Marquez 15436
You can then use pd.drop to drop that first column if you have no use for it
df.drop(df.columns[0], axis=1, inplace=True)

Related

Import a text file with Pandas as a Dataframe where columns can contain multiple words, single words, or numbers

I was given a .txt file with 10000 rows that contain the title, imdb rating, number of votes, genres, and other information about movies. We are supposed to import this to a dataframe with pandas, but I can't figure out how to tell pandas where to separate the columns correctly. For example the first line is the movie "The Shawshank Redemption", but the second row is "Pulp Fiction". There are no commas separating the information in the .txt, only spaces. So Pandas is reading "The" "Shawshank" "Redemption" as separate fields. How am I supposed to tell Pandas how to correctly break up the .txt file? My code right now is:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
import pandas as pd
import numpy as np
labels = ['imdbID','title','year','score','votes','runtime','genres']
df = pd.read_csv('imdb_top_10000.txt', sep = ' ')
I am getting this error code:
ParserError: Error tokenizing data. C error: Expected 6 fields in line 10, saw 12
You are using the wrong separator, the error code indicates that using your defined separator outputs more fields than expected, which then is an inconsistent format for a table.
import pandas as pd
labels = ['imdbID','title','year','score','votes','runtime','genres']
df = pd.read_csv('test.txt', sep = '\t', names = labels)
I took a quick look at a similar data file and using tab delimiters, \t should solve the problem for you. You can also pass your column names directly while constructing your dataframe.
It is always worthwhile to understand your data input structure beforehand.

How to adjust table header when saving dataframe to excel using Pandas?

The objective is to save a df as xlsx format using the code below.
import pandas as pd
from pandas import DataFrame
list_me = [['A','A','A','A','A','B','C','D','D','D','D'],
['TT','TT','UU','UU','UU','UU','UU','TT','TT','TT','TT'],
['5','2','1','1','1','40','10','2','2','2','2'],
['1','1','1','2','3','3','1','2','2','2','1']]
df = DataFrame (list_me).transpose()
df.columns = ['Name','Activity','Hour','Month']
df_tab=pd.crosstab(df.Name, columns=[df.Month, df.Activity], values=df.Hour, aggfunc='sum').fillna(0)
df_tab.reset_index ( level=0, inplace=True )
df_tab.to_excel("output.xlsx")
The code work fine and outputted xlsx as below:
However, I notice adding index on the first column separate the text Month, Activity, Name into separate columns.
May I know whether there is a build-in setting within Pandas that can produce the output as below?
Thanks in advance
p.s.: Please ignore the yellow line, it just to indicate there should be a blank row.

Python vaex how to create dataframe from a CSV file?

Why do I only get the last column
if __name__ == '__main__':
# win远程linux运行
import vaex,pandas as pd
df_pd = pd.read_csv('./a.csv') # contains 4 columns
print(df_pd)
print(list(df_pd.columns))
df = vaex.from_pandas(df_pd) # only last column # why???
print(df)
Why do I only get the last column
Vaex replaces non-ascii characters by an underscore, but two underscores means 'hidden' column. We should change that, and I've opened an issue for that:
https://github.com/vaexio/vaex/issues/558
To create a vaex dataframe out of a csv file.
Try, vaex.from_csv('a.csv')
If the dataset is huge and is around billions of data then you might have to use chunk_size in from_csv to avoid memory issues.

Remove spaces in column data without loosing their original data in python pandas dataframe

import pandas as pd
df=pd.read_excel('test.xlsx')
ab= df.MobileNum.str.replace(' ','')
print ab
As in the screenshot you can see first four row shows nan which is my 10 digit original number without space and other has spaces .
so i want to show first four row data with this result.
You can always do this:
new = df.MobileNum.str.replace(r"\s+", "")
print(new)

How to delete some rows with comments from a CSV file to load data to DataFrame?

There is a relatively big CSV file with data (around 80Mb). When I open it in MS Excel, I see that it contains 100 columns with many rows of data. However, the first row is not the column names, it's a web link. Furthermore, the last two rows are some comments.
So, now I want to load this data into pandas DataFrame:
import pandas as pd
df = pd.read_csv('myfile.csv')
Then I want to read a column called Duration (I see that it exists in CSV file) and delete a word years from it's values:
Duration = map(lambda x: float(x.rstrip('years')), df['Duration'])
It gives me this error:
AttributeError: 'float' object has no attribute 'rstrip'
If I open the file in MS Excel and delete the first row (a web link) and the last two rows (the comments), then the code works!
So, how can I clean this CSV file automatically in Python (to extract only columns with values)?
Update:
When I write print df.head(), it outputs:
have mixed types. Specify dtype option on import or set low_memory=False.
Do I need ot specify Type for all 100 columns? What if I don't know the types apriori.
Update:
I cannot attach the file, but as an example you can check this one.
Download the file 2015-2016.
There are some parameters in pd.read_csv() that you should use:
df = pdread_csv('myfile.csv', skiprows=1, skip_footer=2)
I looked at the link you provided in the comments and tried to import it. I saw two mixed data types (for id and desc), so I explicitly set the dtype for these two columns. Also, by observation, the footer contains 'Total', so I excluded any row starting with the letter T. Other than the headers, valid rows should start with integers for the id column. If there are other footers not starting with T that are introduced, this will throw an error when read.
If you first download and uncompress the zip file, you can proceed as follows:
file_loc = ... # Specify location where you saved the unzipped file.
df = pd.read_csv(file_loc, skiprows=1, skip_blank_lines=True,
dtype={'id': int, 'desc': str}, comment='T')
And this will strip out year or years from the emp_length column, although you are still left with text categories.
df['emp_length'] = df.emp_length.str.replace(r'( years|year)', '')
To skip the first line, you could use the skiprows option in read_csv. If the last two lines are not too tricky (i.e. that they cause some parsing errors), you could use .iloc to ignore them. Finally, a vectorized version of rstrip is available via the str attribute of the Duration column, assuming it contains strings.
See the following code for an example:
import pandas as pd
from StringIO import StringIO
content = StringIO("""http://www.example.com
col1,col2,Duration
1,11,5 years
2,22,4 years
3,33,2 years
# Some comments in the
# last two lines here.
""")
df = pd.read_csv(content, skiprows=1).iloc[:-2]
df['Duration'] = df.Duration.str.rstrip('years').astype(float)
print df
Output:
col1 col2 Duration
0 1 11 5
1 2 22 4
2 3 33 2
If reading speed is not a concern, you can also use the skip_footer=2 option in read_csv to skip the last two lines. This will cause read_csv to use the Python parser engine instead of the faster C engine.

Categories