I am unable to skip the second row of a data file while reading a csv file in python.
I am using the following code :
imdb_data = pd.read_csv('IMDB_data.csv', encoding = "ISO-8859-1",skiprows = 2)
Your code will ommit the first two lines of your csv.
If you want the second line to be ommitted (but the first one included) just do this minor change:
imdb_data = pd.read_csv('IMDB_data.csv', encoding = "ISO-8859-1",skiprows = [1])
Looking at the documentation we can learn that if you supply an integer n for skiprows, the first n rows are skipped. If you want to skip single lines explicitly by line number (0 indexed), you must supply a list-like argument.
In your specific case, that would be skiprows=[1].
The question has already answered. If one wants to skip number of rows at once, one can do the following:
df = pd.read_csv("transaction_activity.csv", skiprows=list(np.arange(1, 13)))
It will skip rows from second up to 12 by keeping your original columns in the dataframe, as it is counted '0'.
Hope it helps for similar problem.
Related
I have the following code to generate a .csv File:
sfdc_dataframe.to_csv('sfdc_data_demo.csv',index=False,header=True)
It is just one column, how could I get the last value of the column, and delete the last comma in the value?
Example image of input data:
https://i.stack.imgur.com/M5nVO.png
And the result that im try to make:
https://i.stack.imgur.com/fEOXM.png
Anyone have an idea or tip?
Thanks!
Once after reading csv file in dataframe(logic shared by you), you can use below logic which is specifically for last row of your specific column replace.
sfdc_dataframe['your_column_name'].iat[-1]=sfdc_dataframe['your_column_name'].iat[-1].str[:-1]
Updated answer below as it only required to change value of the last row.
val = sfdc_dataframe.iloc[-1, sfdc_dataframe.columns.get_loc('col')]
sfdc_dataframe.iloc[-1, sfdc_dataframe.columns.get_loc('col')] = val[:-1]
Easy way
df = pd.read_csv("df_name.csv", dtype=object)
df['column_name'] = df['column_name'].str[:-1]
So I have many csv files which I have to read into a dataframe. Only problem is that they all have a description and metadata in the first 4 lines like this:
#Version: 1.0
#Date: 2006-11-02 00:00:08
After these, I have a normal csv data. How to deal with this? I could remove them manually, only problem is that i have too many such files.
use skip_rows parameter of pd.read_csv().
According to documentation:
skip_rows: Line numbers to skip (0-indexed) or number of lines to skip (int) at the start of the file. So call it like this:
df = pd.read_csv("path_tocsv.csv", skip_rows= lambda x: x in [0, 1, 2, 3])
The advantage of this is that this way is we can determine which rows to skip and which to not. Otherwise simple passing skip_rows=4 skips first 4 rows.
Simply skip the first 4 rows:
df = pd.read_csv("/path/to/file", skiprows=4)
I'm fairly new to Python and still learning the ropes, so I need help with a step by step program without using any functions. I understand how to count through an unknown column range and output the quantity. However, for this program, I'm trying to loop through a column, picking out unique numbers and counting its frequency.
So I have an excel file with random numbers down column A. I only put in 20 numbers but let's pretend the range is unknown. How would I go about extracting the unique numbers and inputting them into a separate column along with how many times they appeared in the list?
I'm not really sure how to go about this. :/
unique = 1
while xw.Range((unique,1)).value != None:
frequency = 0
if unique != unique: break
quantity += 1
"end"
I presume as you can't use functions this may be homework...so, high level:
You could first go through the column and then put all the values in a list?
Secondly take the first value from the list and go through the rest of the list - is it in there? If so then it is not unique. Now remove the value where you have found the duplicate from the list. Keep going if you find another remove that too.
Take the second value and so on?
You would just need list comprehension, some loops and perhaps .pop()
Using pandas library would be the easiest way to do. I created a sample excel sheet having only one column called "Random_num"
import pandas
data = pandas.read_excel("sample.xlsx", sheet_name = "Sheet1")
print(data.head()) # This would give you a sneak peek of your data
print(data['Random_num'].value_counts()) # This would solve the problem you asked for
# Make sure to pass your column name within the quotation marks
#eg: data['your_column'].value_counts()
Thanks
I am looking to create a new dataframe that filters out redundant information from a previous dataframe. The original dataframe is created from looking through many file folders and providing a column of elements each containing a string of the full path to access each file. Each file is named according to trial number and score in a corresponding test folder. I need to remove all reiterations of scores that are 100 for each trial, however, the first score of 100 for each trial must remain.
With python Pandas, I am aware of using
df[df[col_header].str.contains('text')]
to specifically filter out what is needed and the use of '~' as a boolean NOT.
The unfiltered dataframe column with redundant scores looks like this
\\desktop\Test_Scores\test1\trial1-98
\\desktop\Test_Scores\test1\trial2-100
\\desktop\Test_Scores\test1\trial3-100 #<- must remove
\\desktop\Test_Scores\test2\trial1-95
\\desktop\Test_Scores\test2\trial2-100
\\desktop\Test_Scores\test2\trial3-100 #<- must remove
\\desktop\Test_Scores\test2\trial3-100 #<- must remove
.
.
.
n
The expected result after using some code as a filter would be a dataframe that looks like this
\\desktop\Test_Scores\test1\trial1-98
\\desktop\Test_Scores\test1\trial2-100
\\desktop\Test_Scores\test2\trial1-95
\\desktop\Test_Scores\test2\trial2-100
.
.
.
.
n
This one line should solve your problem.
df = df.loc[df["col"].shift().str.contains("-100") != df["col"].str.contains("-100")]
Update:
df["col"] = df["col"].str.replace('\t','\\t')
df['test_number'] = df.col.str.split('-').str[0].str.split('\\').str[-2]
df['score'] = df.col.str.split('-').str[1]
df.drop_duplicates(["test_number","score"], inplace = True)
df.drop(["test_number","score"],1,inplace = True)
Check this solution out. The reason why I am doing the replace in very first line is your data contains \t which in programming is a tab delimiter.
I am new with python and I want to read my data from a .txt file. There are except of the header only floats. I have 6 columns and very much rows.
To read it, I'm using genfromtxt. If I want to read the first two columns it's working, but if i want to read the 5th column I'm getting the following error:
Line #1357451 (got 4 columns instead of 4)
here's my code:
import numpy as np
data=np.genfromtxt(dateiname, skip_header=1, usecols=(0,1,2,5))
print(data[0:2, 0:3])
I think there are missing some values in the 5th column, so it doesn't work.
Has anyone an idea to fix my problem and read the datas of the 5th column?
From the genfromtxt docs:
Notes
-----
* When spaces are used as delimiters, or when no delimiter has been given
as input, there should not be any missing data between two fields.
If all columns, including missing ones, line up properly you could use a fixed column width version of the delimiter.
An integer or sequence of integers
can also be provided as width(s) of each field.
When a line looks like:
one, 2, 3.4, , 5, ,
it can unambiguously identify 7 columns. If instead it is is
one 2 3.4 5
it can only identify 4 columns (in general two blanks count as one, etc, and trailing blanks are ignored)
I found another solution. With filling_values=0 I could fill the empty values with zero. Now it is working! :)
import numpy as np
data=np.genfromtxt(dateiname, skip_header=1, usecols=(0,1,2,5), delimiter='\t', invalid_raise=False, filling_values=0)
Furthermore I didn't leave the delimiter on default anymore but defined the tab distance and with invalid_raise you could skip the values that are missing.