I have a dataframe converted from tab seperated text file. But the first label is an extra unnecessary label.
a b c
0 1 2 NaN
1 2 3 NaN
The label a is an extra one. The dataframe should be:
b c
0 1 2
1 2 3
How to remove a? Thanks in advance.
You can omit first header row by skiprows parameter and then add parameter names for new columns - is necessary same length of names and length of another rows of data:
df = pd.read_csv(file, skiprows=1, names=['b','c'])
print (df)
b c
0 1 2
1 2 3
Or more dynamic is get only first row by nrows=0 for columns and then pass to parameter names with remove first value by indexing:
names = pd.read_csv(file, nrows=0).columns
df = pd.read_csv(file, skiprows=1, names=names[1:])
Another idea is default columns - RangeIndex:
df = pd.read_csv(file, skiprows=1, header=None)
print (df)
0 1
0 1 2
1 2 3
Related
I want to shuffle columns without order; completely pseudo-randomly, on one line of code.
Before:
A B
0 1 2
1 1 2
After:
B A
0 2 1
1 2 1
My attempts so far:
df = df.reindex(columns=columns)
df.sample(frac=1, axis=1)
df.apply(np.random.shuffle, axis=1)
You can use np.random.default_rng()'s permutation with a seed to make it reproducible.
df = df[np.random.default_rng(seed=42).permutation(df.columns.values)]
Use DataFrame.sample with the axis argument set to columns (1):
df = df.sample(frac=1, axis=1)
print(df)
B A
0 2 1
1 2 1
Or use Series.sample with columns converted to Series and change order of columns by subset:
df = df[df.columns.to_series().sample(frac=1)]
print(df)
B A
0 2 1
1 2 1
Use numpy.random.permutation with list of column names.
df = df[np.random.permutation(df.columns)]
I have a pandas dataframe df1 = {'A':['a','b','c','d','e'],'no.':[0,1,2,3,4]}, df1 = pd.DataFrame(df1,columns=['A','no.']) where I would like to reverse in place the content of the second column with the result being like that: df2 = {'A':['a','b','c','d','e'],'no.':[4,3,2,1,0]} df2 = pd.DataFrame(df2,columns=['A','no.'])
Convert values to numpy and then indexing for change order:
df1['no.'] = df1['no.'].to_numpy()[::-1]
print (df1)
A no.
0 a 4
1 b 3
2 c 2
3 d 1
4 e 0
I have a csv file as follows:
name,age
something
tom,20
And when I put it into a dataframe it looks like:
df = pd.read_csv('file', header=None)
0 1
1 name age
2 something NaN
3 tom 20
How would I get the count of a comma in the raw row data. For example, the answer should look like:
# in pseudocode
df['_count_separators'] = len(df.raw_value.count(','))
0 1 _count_separators
1 name age 1
2 something NaN 0
3 tom 20 1
Very simply, read your data as a single column series, then split on comma and concatenate with separator count.
# s = pd.read_csv(pd.compat.StringIO(text), sep=r'|', squeeze=True, header=None)
s = pd.read_csv('/path/to/file.csv', sep=r'|', squeeze=True, header=None)
pd.concat([
s.str.split(',', expand=True),
s.str.count(',').rename('_count_sep')
], axis=1)
0 1 _count_sep
0 name age 1
1 something None 0
2 tom 20 1
Another solution for concatenation is to join on the index (this is a neat one liner):
s.str.split(',', expand=True).join(s.str.count(',').rename('_count_sep'))
0 1 _count_sep
0 name age 1
1 something None 0
2 tom 20 1
Doing this
df = pd.read_csv('file', header=None)
df2 = pd.read_csv('file', header=None,sep='|') # using another sep for read your csv again
df2['0'].str.findall(',').str.len() # then one row into one cell , using str find
0 1
1 0
2 1
3 5
Name: 0, dtype: int64
df['_count_separators']=df2['0'].str.findall(',').str.len()
Data
name,age
something
tom,20
something,,,,,somethingelse
One line of code: len(df) - df[1].isna().sum()
You can use the csv module for the counting delimiters. This is a two-pass solution, but not necessarily inefficient versus alternative one-pass solutions.
from io import StringIO
import csv, pandas as pd, numpy as np
x = """name,age
something
tom,20"""
# replace StringIO(x) with open('file.csv', 'r')
with StringIO(x) as fin:
delim_counts = np.fromiter(map(len, csv.reader(fin)), dtype=int)
# replace StringIO(x) with 'file.csv'
df = pd.read_csv(StringIO(x), header=None)
df['_count_separators'] = delim_counts - 1
print(df)
0 1 _count_separators
0 name age 1
1 something NaN 0
2 tom 20 1
I have a number of CSV files where the head looks something like:
09/07/2014,26268315,,
10/07/2014,6601181,16.3857
11/07/2014,916651,12.5879
14/07/2014,213357,,
15/07/2014,205019,10.8607
I need to read this into a dataframe and remove any rows with ,, however when I read the CSV data into a dataframe using:
df = pd.read_csv(raw_directory+'\\'+filename, error_bad_lines=False,header=None)
I get:
0 1 2 3
0 09/07/2014 26268315 NaN NaN
1 10/07/2014 6601181 16.3857 NaN
2 11/07/2014 916651 12.5879 NaN
3 14/07/2014 213357 NaN NaN
4 15/07/2014 205019 10.8607 NaN
How can I read the CSV data into a dataframe and get:
0
0 09/07/2014,26268315,,
1 10/07/2014,6601181,16.3857
2 11/07/2014,916651,12.5879
3 14/07/2014,213357,,
4 15/07/2014,205019,10.8607
I need to remove any rows where there are ,, present. and then resave the adjusted dataframe to a new CSV file. I was going to use:
stringList = [',,']
df = df[~df[0].isin([stringList])]
to remove the rows with ,, present so the resulting .csv head looks like:
10/07/2014,6601181,16.3857
11/07/2014,916651,12.5879
15/07/2014,205019,10.8607
I guess here is possible remove all columns with all NaNs and then rows with any NaNs:
df = df.dropna(axis=1, how='all').dropna()
print (df)
0 1 2
1 10/07/2014 6601181 16.3857
2 11/07/2014 916651 12.5879
4 15/07/2014 205019 10.8607
Another solution is add separator which value is not in data like | and then filter by endswith:
df = pd.read_csv(raw_directory+'\\'+filename, error_bad_lines=False,header=None, sep='|')
df = df[~df[0].str.endswith(',')]
#alternative solution - $ is for end of string
#df = df[~df[0].str.contains(',$')]
print (df)
0
1 10/07/2014,6601181,16.3857
2 11/07/2014,916651,12.5879
4 15/07/2014,205019,10.8607
I have a method that adds additional attributes to a given pandas series and I want to update a row in the df with the returned series.
Lets say I have a simple dataframe:
df = pd.DataFrame({'a':[1, 2], 'b':[3, 4]})
a b
0 1 3
1 2 4
and now I want to replace a row with one with additional attributes, all other rows will show Nan for that column ex:
subdf = df.loc[1]
subdf["newVal"] = "foo"
# subdf is created externally and returned. Now it must be updated.
df.loc[1] = subdf #or something
df would look like:
a b newVal
0 1 3 Nan
1 2 4 foo
Without loss in generalisation, first reindex and then assign with (i)loc:
df = df.reindex(subdf.index, axis=1)
df.iloc[-1] = subdf
df
a b newVal
0 1 3 NaN
1 2 4 foo