Python vaex how to create dataframe from a CSV file? - python

Why do I only get the last column
if __name__ == '__main__':
# win远程linux运行
import vaex,pandas as pd
df_pd = pd.read_csv('./a.csv') # contains 4 columns
print(df_pd)
print(list(df_pd.columns))
df = vaex.from_pandas(df_pd) # only last column # why???
print(df)
Why do I only get the last column

Vaex replaces non-ascii characters by an underscore, but two underscores means 'hidden' column. We should change that, and I've opened an issue for that:
https://github.com/vaexio/vaex/issues/558

To create a vaex dataframe out of a csv file.
Try, vaex.from_csv('a.csv')
If the dataset is huge and is around billions of data then you might have to use chunk_size in from_csv to avoid memory issues.

Related

Pandas read_csv does not separate values after comma

I am trying to load some .csv data in the Jupyter notebook but for some reason, it does not separate my data but puts everything in a single column.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df =
pd.read_csv(r'C:\Users\leonm\Documents\Fontys\Semester
4\GriefBot_PracticeChallenge\DummyDataGriefbot.csv')
df.head()
My csv data
In this picture there is the data I am using.
And now I do not understand why my values all come in single column and are not separated where the comas are.
I have also tried both spe=',' and sep=';' but they do not change anything.
This is what I am getting
I would really appreciate your help.
If that's how your data looks in a CSV reader like Excel, then each row likely looks like one big string in a text editor.
"ID,PERSON,DATE"
"1,A. Molina,1593147221"
"2,A. Moran, 16456"
"3,Action Marquez,15436"
You could of course do "text to columns" within Excel and resave your file, or if you have many of these files, you can use the Pandas split function.
df[df.columns[0].split(',')] = df.iloc[:,0].str.split(',', expand=True)
# ^ split header by comma ^ ^ create list split by comma, and expand
# | each list entry into a new column
# | select first column of data
df.head()
> ID,PERSON,DATE ID PERSON DATE
> 0 1,A. Molina,1593147221 1 A. Molina 1593147221
> 1 2,A. Moran, 16456 2 A. Moran 16456
> 2 3,Action Marquez,15436 3 Action Marquez 15436
You can then use pd.drop to drop that first column if you have no use for it
df.drop(df.columns[0], axis=1, inplace=True)

Why is data is getting deleted when merging two CSV files using Pandas?

Hoping to get some insight on this issue, I am using Pandas to try and clean data then merge two records together. The code is below, it is successfully merging the two file headers but then deletes all my rows.
Before I can merge them, I have to rename the column of one file to match the other, then I need to strip out the string from the cell contents and finally convert the column from an object to an INT.
This is my first program, so it is a lot to bite off and I know I can do this in EXCEL in 2 minutes but want to automate it in the long term.
Thanks in advance.
import pandas as pd
import os
os.chdir("c:/users/user/desktop/exercises")
fileA = pd.read_csv("./fileA.csv")
fileB = pd.read_csv("./fileB.csv")
fileA.loc[:,'step'].replace(regex=True,inplace=True, to_replace="Case ID Number - 00", value="")
fileA = fileA.rename(columns={'step':'Case Number'})
fileA['Case Number'] = pd.to_numeric(contextor['Case Number'], errors='raise')
print(fileA.info())
#Merge works but then deletes all the table data
MERGE = fileA.merge(fileB,on='Case Number')
MERGE.to_csv('UPDATEDMERGE.csv')

Pandas is adding an extra column of data when converting from dta to csv [duplicate]

I am trying to save a csv to a folder after making some edits to the file.
Every time I use pd.to_csv('C:/Path of file.csv') the csv file has a separate column of indexes. I want to avoid printing the index to csv.
I tried:
pd.read_csv('C:/Path to file to edit.csv', index_col = False)
And to save the file...
pd.to_csv('C:/Path to save edited file.csv', index_col = False)
However, I still got the unwanted index column. How can I avoid this when I save my files?
Use index=False.
df.to_csv('your.csv', index=False)
There are two ways to handle the situation where we do not want the index to be stored in csv file.
As others have stated you can use index=False while saving your
dataframe to csv file.
df.to_csv('file_name.csv',index=False)
Or you can save your dataframe as it is with an index, and while reading you just drop the column unnamed 0 containing your previous index.Simple!
df.to_csv(' file_name.csv ')
df_new = pd.read_csv('file_name.csv').drop(['unnamed 0'],axis=1)
If you want no index, read file using:
import pandas as pd
df = pd.read_csv('file.csv', index_col=0)
save it using
df.to_csv('file.csv', index=False)
As others have stated, if you don't want to save the index column in the first place, you can use df.to_csv('processed.csv', index=False)
However, since the data you will usually use, have some sort of index themselves, let's say a 'timestamp' column, I would keep the index and load the data using it.
So, to save the indexed data, first set their index and then save the DataFrame:
df.set_index('timestamp')
df.to_csv('processed.csv')
Afterwards, you can either read the data with the index:
pd.read_csv('processed.csv', index_col='timestamp')
or read the data, and then set the index:
pd.read_csv('filename.csv')
pd.set_index('column_name')
Another solution if you want to keep this column as index.
pd.read_csv('filename.csv', index_col='Unnamed: 0')
If you want a good format next statement is the best:
dataframe_prediction.to_csv('filename.csv', sep=',', encoding='utf-8', index=False)
In this case you have got a csv file with ',' as separate between columns and utf-8 format.
In addition, numerical index won't appear.

How to get column names from a large file by using python dataframes

Hi I have a very huge tsv file. It is about 1GB. I just want to create an array that contains column names. This is what I've done so far:
import pandas as pd
x = pd.read_csv('mytsvfile.tsv', nrows=1).columns
Unfortunatelly, this gives me
>>> type(x)
<class 'pandas.core.indexes.base.Index'>
and when I convert it to list, the length of list is 1 which is not equal to number of columns I have in tsv file
I think you need add separator (\t if tab) and also nrows=0 works:
x = pd.read_csv('mytsvfile.tsv', nrows=0, sep='\t').columns.tolist()
If you only need the column names, and the column names are in the first line, and you need it in a python list format, why bring pandas into the mix at all? Just use readline() like so:
with open('mytsvfile.tsv', 'r') as tsv:
Columns=tsv.readline().split('\t')
Sorry about the tab character, I'm on mobile.
what you are looking for can be obtained without any intermediate:
list_of_column_names=list(x)
more generally:
list(df.columns)
you can also determine the number of columns of your dataframe df:
df.columns.nunique()

How to delete some rows with comments from a CSV file to load data to DataFrame?

There is a relatively big CSV file with data (around 80Mb). When I open it in MS Excel, I see that it contains 100 columns with many rows of data. However, the first row is not the column names, it's a web link. Furthermore, the last two rows are some comments.
So, now I want to load this data into pandas DataFrame:
import pandas as pd
df = pd.read_csv('myfile.csv')
Then I want to read a column called Duration (I see that it exists in CSV file) and delete a word years from it's values:
Duration = map(lambda x: float(x.rstrip('years')), df['Duration'])
It gives me this error:
AttributeError: 'float' object has no attribute 'rstrip'
If I open the file in MS Excel and delete the first row (a web link) and the last two rows (the comments), then the code works!
So, how can I clean this CSV file automatically in Python (to extract only columns with values)?
Update:
When I write print df.head(), it outputs:
have mixed types. Specify dtype option on import or set low_memory=False.
Do I need ot specify Type for all 100 columns? What if I don't know the types apriori.
Update:
I cannot attach the file, but as an example you can check this one.
Download the file 2015-2016.
There are some parameters in pd.read_csv() that you should use:
df = pdread_csv('myfile.csv', skiprows=1, skip_footer=2)
I looked at the link you provided in the comments and tried to import it. I saw two mixed data types (for id and desc), so I explicitly set the dtype for these two columns. Also, by observation, the footer contains 'Total', so I excluded any row starting with the letter T. Other than the headers, valid rows should start with integers for the id column. If there are other footers not starting with T that are introduced, this will throw an error when read.
If you first download and uncompress the zip file, you can proceed as follows:
file_loc = ... # Specify location where you saved the unzipped file.
df = pd.read_csv(file_loc, skiprows=1, skip_blank_lines=True,
dtype={'id': int, 'desc': str}, comment='T')
And this will strip out year or years from the emp_length column, although you are still left with text categories.
df['emp_length'] = df.emp_length.str.replace(r'( years|year)', '')
To skip the first line, you could use the skiprows option in read_csv. If the last two lines are not too tricky (i.e. that they cause some parsing errors), you could use .iloc to ignore them. Finally, a vectorized version of rstrip is available via the str attribute of the Duration column, assuming it contains strings.
See the following code for an example:
import pandas as pd
from StringIO import StringIO
content = StringIO("""http://www.example.com
col1,col2,Duration
1,11,5 years
2,22,4 years
3,33,2 years
# Some comments in the
# last two lines here.
""")
df = pd.read_csv(content, skiprows=1).iloc[:-2]
df['Duration'] = df.Duration.str.rstrip('years').astype(float)
print df
Output:
col1 col2 Duration
0 1 11 5
1 2 22 4
2 3 33 2
If reading speed is not a concern, you can also use the skip_footer=2 option in read_csv to skip the last two lines. This will cause read_csv to use the Python parser engine instead of the faster C engine.

Categories