How to merge multiple csv files? - python

I have several csv files that has same first row element in it.
For example:
csv-1.csv:
Value,0
Currency,0
datetime,0
Receiver,0
Beneficiary,0
Flag,0
idx,0
csv-2.csv:
Value,0
Currency,1
datetime,0
Receiver,0
Beneficiary,0
Flag,0
idx,0
And with these files (more than 2 files by the way) I want to merge them and create something like that:
left
csv-1
csv-2
Value
0
0
Currency
0
1
datetime
0
0
How can I create this funtion in python?

You can use:
import pandas as pd
import pathlib
out = (pd.concat([pd.read_csv(csvfile, header=None, index_col=[0], names=[csvfile.stem])
for csvfile in sorted(pathlib.Path.cwd().glob('*.csv'))], axis=1)
.rename_axis('left').reset_index())
Output:
>>> out
left csv-1 csv-2
0 Value 0 0
1 Currency 0 1
2 datetime 0 0
3 Receiver 0 0
4 Beneficiary 0 0
5 Flag 0 0
6 idx 0 0

First, you must create indexes in dataframes by first columns, on which you will further join:
import pandas as pd
import numpy as np
df1 = pd.read_csv('csv-1.csv')
df2 = pd.read_csv('csv-2.csv')
df1 = df1.set_index('col1')
df2 = df2.set_index('col1')
df = df1.join(df2, how='outer')
Then rename the column names if needed, or make a new index

Here's what you can do
import pandas as pd
from glob import glob
def refineFilename(path):
return path.split(".")[0]
df=pd.DataFrame()
for file in glob("csv-*.csv"):
new=pd.read_csv(file,header=None,index_col=[0])
df[refineFinename(file)]=new[1]
df.reset_index(inplace=True)
df.rename(columns={0:"left"},inplace=True)
print(df)
"""
left csv-1 csv-2
0 Value 0 0
1 Currency 0 1
2 datetime 0 0
3 Receiver 0 0
4 Beneficiary 0 0
5 Flag 0 0
6 idx 0 0
"""
What we are doing here is making the df variable store all data, and then iterating through all files and adding a second column of those files to df with file name as the column name. 

Related

find union of a particular row values with two row values in different dataset in python

I have two data frame dfa and dfb
import pandas as pd
import numpy as np
dfa = pd.DataFrame(np.array([[1,0,0,1,0], [1,1,1,0,0]]))
dfa
dfb = pd.DataFrame(np.array([[1,1,0,1,0], [0,0,1,0,0],[0,0,1,1,1]]))
dfb
Now, I need overwrite data frame dfa first row by taking union of data frame dfb first row and second row
[1,0,0,1,0] union [1,1,0,1,0] union [0,0,1,0,0] = [1,1,1,1,0]
final data dfa frame I need look like below
dfa
0 1 2 3 4
0 1 1 1 1 0
1 1 1 1 0 0

How can I groupby over multiple files in a folder in Python?

I have a folder with 30 csvs. All of them have unique columns from one another with the exception of a single "UNITID" column. I'm looking to do a groupby function on that UNITID column across all the csvs.
Ultimately I want a single dataframe with all the columns next to each other for each UNITID.
Any thoughts on how I can do that?
Thanks in advance.
Perhaps you could merge the dataframes together, one at a time? Something like this:
# get a list of your csv paths somehow
list_of_csvs = get_filenames_of_csvs()
# load the first csv file into a DF to start with
big_df = pd.read_csv(list_of_csvs[0])
# merge to other csvs into the first, one at a time
for csv in list_of_csvs[1:]:
df = pd.read_csv(csv)
big_df = big_df.merge(df, how="outer", on="UNITID")
All the csvs will be merged together based on UNITID, preserving the union of all columns.
An alternative one-liner to dustin's solution would be the combination of the functool's reduce function and DataFrame.merge()
like so,
from functools import reduce # standard library, no need to pip it
from pandas import DataFrame
# make some dfs
df1
id col_one col_two
0 0 a d
1 1 b e
2 2 c f
df2
id col_three col_four
0 0 A D
1 1 B E
2 2 C F
df3
id col_five col_six
0 0 1 4
1 1 2 5
2 2 3 6
The one-liner:
reduce(lambda x,y: x.merge(y, on= "id"), [df1, df2, df3])
id col_one col_two col_three col_four col_five col_six
0 0 a d A D 1 4
1 1 b e B E 2 5
2 2 c f C F 3 6
functools.reduce docs
pandas.DataFrame.merge docs

Python: Create two dummy columns from one column

I have a one column Pandas dataframe:
'asdf'
0
1
1
1
0
...
1
How do I turn it into:
1 0
0 1
0 1
0 1
1 0
such that the first column is the "0" column and the second column is the "1" column and both are dummy variables?
You can use get_dummies, for example
import numpy as np
import pandas as pd
one = pd.DataFrame({'asdf':np.random.randint(0,2,10)})
two = pd.get_dummies(one.loc[:,'asdf'])

Count separators in CSV rows with Pandas

I have a csv file as follows:
name,age
something
tom,20
And when I put it into a dataframe it looks like:
df = pd.read_csv('file', header=None)
0 1
1 name age
2 something NaN
3 tom 20
How would I get the count of a comma in the raw row data. For example, the answer should look like:
# in pseudocode
df['_count_separators'] = len(df.raw_value.count(','))
0 1 _count_separators
1 name age 1
2 something NaN 0
3 tom 20 1
Very simply, read your data as a single column series, then split on comma and concatenate with separator count.
# s = pd.read_csv(pd.compat.StringIO(text), sep=r'|', squeeze=True, header=None)
s = pd.read_csv('/path/to/file.csv', sep=r'|', squeeze=True, header=None)
pd.concat([
s.str.split(',', expand=True),
s.str.count(',').rename('_count_sep')
], axis=1)
0 1 _count_sep
0 name age 1
1 something None 0
2 tom 20 1
Another solution for concatenation is to join on the index (this is a neat one liner):
s.str.split(',', expand=True).join(s.str.count(',').rename('_count_sep'))
0 1 _count_sep
0 name age 1
1 something None 0
2 tom 20 1
Doing this
df = pd.read_csv('file', header=None)
df2 = pd.read_csv('file', header=None,sep='|') # using another sep for read your csv again
df2['0'].str.findall(',').str.len() # then one row into one cell , using str find
0 1
1 0
2 1
3 5
Name: 0, dtype: int64
df['_count_separators']=df2['0'].str.findall(',').str.len()
Data
name,age
something
tom,20
something,,,,,somethingelse
One line of code: len(df) - df[1].isna().sum()
You can use the csv module for the counting delimiters. This is a two-pass solution, but not necessarily inefficient versus alternative one-pass solutions.
from io import StringIO
import csv, pandas as pd, numpy as np
x = """name,age
something
tom,20"""
# replace StringIO(x) with open('file.csv', 'r')
with StringIO(x) as fin:
delim_counts = np.fromiter(map(len, csv.reader(fin)), dtype=int)
# replace StringIO(x) with 'file.csv'
df = pd.read_csv(StringIO(x), header=None)
df['_count_separators'] = delim_counts - 1
print(df)
0 1 _count_separators
0 name age 1
1 something NaN 0
2 tom 20 1

Iterate over columns using conditional in python?

When working with a DataFrame, is there a way to change the value of a cell based on a value in a column?
For example, I have a DataFrame of exam results that looks like this:
answer_is_a answer_is_c
0 a a
1 b b
2 c c
I want to code them as correct (1) and incorrect(0). So it would look like this:
answer_is_a answer_is_c
0 1 0
1 0 0
2 0 1
So I need to iterate over the entire DataFrame, compare what is already in the cell with the last character of the column header, and then change the cell value.
Any thoughts?
By default, DataFrame.apply iterates through the columns, passing each as a series to the function you feed it. Series have a name attribute that is a string we'll use to extract the answer.
So you could do this:
from io import StringIO
import pandas
data = StringIO("""\
answer_is_a answer_is_c
a a
b b
c c
""")
x = (
pandas.read_table(data, sep='\s+')
.apply(lambda col: col == col.name.split('_')[-1])
.astype(int)
)
And x prints out as:
answer_is_a answer_is_c
0 1 0
1 0 0
2 0 1

Categories