Count separators in CSV rows with Pandas - python

I have a csv file as follows:
name,age
something
tom,20
And when I put it into a dataframe it looks like:
df = pd.read_csv('file', header=None)
0 1
1 name age
2 something NaN
3 tom 20
How would I get the count of a comma in the raw row data. For example, the answer should look like:
# in pseudocode
df['_count_separators'] = len(df.raw_value.count(','))
0 1 _count_separators
1 name age 1
2 something NaN 0
3 tom 20 1

Very simply, read your data as a single column series, then split on comma and concatenate with separator count.
# s = pd.read_csv(pd.compat.StringIO(text), sep=r'|', squeeze=True, header=None)
s = pd.read_csv('/path/to/file.csv', sep=r'|', squeeze=True, header=None)
pd.concat([
s.str.split(',', expand=True),
s.str.count(',').rename('_count_sep')
], axis=1)
0 1 _count_sep
0 name age 1
1 something None 0
2 tom 20 1
Another solution for concatenation is to join on the index (this is a neat one liner):
s.str.split(',', expand=True).join(s.str.count(',').rename('_count_sep'))
0 1 _count_sep
0 name age 1
1 something None 0
2 tom 20 1

Doing this
df = pd.read_csv('file', header=None)
df2 = pd.read_csv('file', header=None,sep='|') # using another sep for read your csv again
df2['0'].str.findall(',').str.len() # then one row into one cell , using str find
0 1
1 0
2 1
3 5
Name: 0, dtype: int64
df['_count_separators']=df2['0'].str.findall(',').str.len()
Data
name,age
something
tom,20
something,,,,,somethingelse

One line of code: len(df) - df[1].isna().sum()

You can use the csv module for the counting delimiters. This is a two-pass solution, but not necessarily inefficient versus alternative one-pass solutions.
from io import StringIO
import csv, pandas as pd, numpy as np
x = """name,age
something
tom,20"""
# replace StringIO(x) with open('file.csv', 'r')
with StringIO(x) as fin:
delim_counts = np.fromiter(map(len, csv.reader(fin)), dtype=int)
# replace StringIO(x) with 'file.csv'
df = pd.read_csv(StringIO(x), header=None)
df['_count_separators'] = delim_counts - 1
print(df)
0 1 _count_separators
0 name age 1
1 something NaN 0
2 tom 20 1

Related

How to merge multiple csv files?

I have several csv files that has same first row element in it.
For example:
csv-1.csv:
Value,0
Currency,0
datetime,0
Receiver,0
Beneficiary,0
Flag,0
idx,0
csv-2.csv:
Value,0
Currency,1
datetime,0
Receiver,0
Beneficiary,0
Flag,0
idx,0
And with these files (more than 2 files by the way) I want to merge them and create something like that:
left
csv-1
csv-2
Value
0
0
Currency
0
1
datetime
0
0
How can I create this funtion in python?
You can use:
import pandas as pd
import pathlib
out = (pd.concat([pd.read_csv(csvfile, header=None, index_col=[0], names=[csvfile.stem])
for csvfile in sorted(pathlib.Path.cwd().glob('*.csv'))], axis=1)
.rename_axis('left').reset_index())
Output:
>>> out
left csv-1 csv-2
0 Value 0 0
1 Currency 0 1
2 datetime 0 0
3 Receiver 0 0
4 Beneficiary 0 0
5 Flag 0 0
6 idx 0 0
First, you must create indexes in dataframes by first columns, on which you will further join:
import pandas as pd
import numpy as np
df1 = pd.read_csv('csv-1.csv')
df2 = pd.read_csv('csv-2.csv')
df1 = df1.set_index('col1')
df2 = df2.set_index('col1')
df = df1.join(df2, how='outer')
Then rename the column names if needed, or make a new index
Here's what you can do
import pandas as pd
from glob import glob
def refineFilename(path):
return path.split(".")[0]
df=pd.DataFrame()
for file in glob("csv-*.csv"):
new=pd.read_csv(file,header=None,index_col=[0])
df[refineFinename(file)]=new[1]
df.reset_index(inplace=True)
df.rename(columns={0:"left"},inplace=True)
print(df)
"""
left csv-1 csv-2
0 Value 0 0
1 Currency 0 1
2 datetime 0 0
3 Receiver 0 0
4 Beneficiary 0 0
5 Flag 0 0
6 idx 0 0
"""
What we are doing here is making the df variable store all data, and then iterating through all files and adding a second column of those files to df with file name as the column name. 

Shuffle Columns in Dataframe

I want to shuffle columns without order; completely pseudo-randomly, on one line of code.
Before:
A B
0 1 2
1 1 2
After:
B A
0 2 1
1 2 1
My attempts so far:
df = df.reindex(columns=columns)
df.sample(frac=1, axis=1)
df.apply(np.random.shuffle, axis=1)
You can use np.random.default_rng()'s permutation with a seed to make it reproducible.
df = df[np.random.default_rng(seed=42).permutation(df.columns.values)]
Use DataFrame.sample with the axis argument set to columns (1):
df = df.sample(frac=1, axis=1)
print(df)
B A
0 2 1
1 2 1
Or use Series.sample with columns converted to Series and change order of columns by subset:
df = df[df.columns.to_series().sample(frac=1)]
print(df)
B A
0 2 1
1 2 1
Use numpy.random.permutation with list of column names.
df = df[np.random.permutation(df.columns)]

pandas dataframe: delete empty label name

I have a dataframe converted from tab seperated text file. But the first label is an extra unnecessary label.
a b c
0 1 2 NaN
1 2 3 NaN
The label a is an extra one. The dataframe should be:
b c
0 1 2
1 2 3
How to remove a? Thanks in advance.
You can omit first header row by skiprows parameter and then add parameter names for new columns - is necessary same length of names and length of another rows of data:
df = pd.read_csv(file, skiprows=1, names=['b','c'])
print (df)
b c
0 1 2
1 2 3
Or more dynamic is get only first row by nrows=0 for columns and then pass to parameter names with remove first value by indexing:
names = pd.read_csv(file, nrows=0).columns
df = pd.read_csv(file, skiprows=1, names=names[1:])
Another idea is default columns - RangeIndex:
df = pd.read_csv(file, skiprows=1, header=None)
print (df)
0 1
0 1 2
1 2 3

Drop row from data-frame where that contains a specific string

I have a number of CSV files where the head looks something like:
09/07/2014,26268315,,
10/07/2014,6601181,16.3857
11/07/2014,916651,12.5879
14/07/2014,213357,,
15/07/2014,205019,10.8607
I need to read this into a dataframe and remove any rows with ,, however when I read the CSV data into a dataframe using:
df = pd.read_csv(raw_directory+'\\'+filename, error_bad_lines=False,header=None)
I get:
0 1 2 3
0 09/07/2014 26268315 NaN NaN
1 10/07/2014 6601181 16.3857 NaN
2 11/07/2014 916651 12.5879 NaN
3 14/07/2014 213357 NaN NaN
4 15/07/2014 205019 10.8607 NaN
How can I read the CSV data into a dataframe and get:
0
0 09/07/2014,26268315,,
1 10/07/2014,6601181,16.3857
2 11/07/2014,916651,12.5879
3 14/07/2014,213357,,
4 15/07/2014,205019,10.8607
I need to remove any rows where there are ,, present. and then resave the adjusted dataframe to a new CSV file. I was going to use:
stringList = [',,']
df = df[~df[0].isin([stringList])]
to remove the rows with ,, present so the resulting .csv head looks like:
10/07/2014,6601181,16.3857
11/07/2014,916651,12.5879
15/07/2014,205019,10.8607
I guess here is possible remove all columns with all NaNs and then rows with any NaNs:
df = df.dropna(axis=1, how='all').dropna()
print (df)
0 1 2
1 10/07/2014 6601181 16.3857
2 11/07/2014 916651 12.5879
4 15/07/2014 205019 10.8607
Another solution is add separator which value is not in data like | and then filter by endswith:
df = pd.read_csv(raw_directory+'\\'+filename, error_bad_lines=False,header=None, sep='|')
df = df[~df[0].str.endswith(',')]
#alternative solution - $ is for end of string
#df = df[~df[0].str.contains(',$')]
print (df)
0
1 10/07/2014,6601181,16.3857
2 11/07/2014,916651,12.5879
4 15/07/2014,205019,10.8607

split each cell in dataframe (pandas/python)

I have a large pandas dataframe consisting of many rows and columns containing binary data like '0|1', '0|0','1|1','1|0' which i would like to split either in 2 dataframes, and/or expand so that this (both are useful to me):
a b c d
rowa 1|0 0|1 0|1 1|0
rowb 0|1 0|0 0|0 0|1
rowc 0|1 1|0 1|0 0|1
becomes
a b c d
rowa1 1 0 0 1
rowa2 0 1 1 0
rowb1 0 0 0 0
rowb2 1 0 0 1
rowc1 0 1 1 0
rowc2 1 0 0 1
and/or
df1: a b c d
rowa 1 0 0 1
rowb 0 0 0 0
rowc 0 1 1 0
df2: a b c d
rowa 0 1 1 0
rowb 1 0 0 1
rowc 1 0 0 1
currently i'm trying to do something like the following, but believe this is not very effective, any guidance would be helpful.
Atmp_dict=defaultdict(list)
Btmp_dict=defaultdict(list)
for index,row in df.iterrows():
for columnname in list(df.columns.values):
Atmp_dict[columnname].append(row[columnname].split('|')[0])
Btmp_dict[columnname].append(row[columnname].split('|')[1])
user2734178 is close, but his or her answer has some issues. Here is a slight variation that works
import pandas as pd
df1 = pd.DataFrame()
df2 = pd.DataFrame()
# df is your original DataFrame
for col in df.columns:
df1[col] = df[col].apply(lambda x: x.split('|')[0])
df2[col] = df[col].apply(lambda x: x.split('|')[1])
Here is another option that is slightly more elegant. Replace the loop with:
for col in df.columns:
df1[col] = df[col].str.extract("(\d)\|")
df2[col] = df[col].str.extract("\|(\d)")
This is pretty compact, but it seems like there should be an even easier and more compact way.
df1 = df.applymap( lambda x: str(x)[0] )
df2 = df.applymap( lambda x: str(x)[2] )
Or loop over the columns as in the other answers. I don't think it matters. Note that because the question specified binary data, it is OK (and simpler) to just do str[0] and str[2] rather than using split or extract.
Or you could do this, which seems almost silly, but there's nothing actually wrong with it and it is fairly compact.
df1 = df.stack().str[0].unstack()
df2 = df.stack().str[2].unstack()
stack just converts it to a series so you can use str and then unstack converts it back to a dataframe.
Since it looks like all of your values are strings, you can use the .str accessor to split up everything using the pipe as your delimiter, comme ca,
import pandas as pd
df1 = pd.DataFrame()
df2 = pd.DataFrame()
#df is defined as in your first example
for col in df.columns:
df1[col] = df[col].str[0]
df2[col] = df[col].str[-1]
You'll then probably want to recast your df1 and df2 as int columns using astype(int).

Categories