Creating new columns in a csv file using data from a different csv file - python

I have this Data Science problem where I need to create a test set using info provided in two csv files.
Problem
data1.csv
cat,In1,In2
aaa, 0, 1
aaa, 2, 1
aaa, 2, 0
aab, 3, 2
aab, 1, 2
data2.csv
cat,index,attribute1,attribute2
aaa, 0, 150, 450
aaa, 1, 250, 670
aaa, 2, 30, 250
aab, 0, 60, 650
aab, 1, 50, 30
aab, 2, 20, 680
aab, 3, 380, 250
From these two files what I need is a updated data1.csv file. Where in place of In1 and In2, I need the attributes of the specific indices(In1 and In2), under a specific category (cat).
Note: All the indices in a specific category (cat) have their own attributes.
Result should look like this,
updated_data1.csv
cat,In1a1,In1a2,In2a1,In2a2
aaa, 150, 450, 250, 670
aaa, 30, 250, 250, 670
aaa, 30, 250, 150, 450
aab, 380, 250, 20, 680
aab, 50, 30, 20, 680
I need an approach to tackle this problem using pandas in python. So far I have loaded the csv files in to my jupyter notebook. And I have no clue where to start.
Please note this is my first week using python for data manipulation and I have a very little knowledge on python. Also pardon me for ugly formatting. I'm using the mobile phone to type this question.

As others have suggested, you can use pd.merge. In this case, you need to merge on multiple columns. Basically you need to define which columns from the left DataFrame (here data1) map to which columns from the right DataFrame (here data2). Also see pandas merging 101.
# Read the csvs
data1 = pd.read_csv('data1.csv')
data2 = pd.read_csv('data2.csv')
# DataFrame with the in1 columns
df1 = pd.merge(left=data1, right=data2, left_on = ['cat','In1'], right_on = ['cat', 'index'])
df1 = df1[['cat','attribute1','attribute2']].set_index('cat')
# DataFrame with the in2 columns
df2 = pd.merge(left=data1, right=data2, left_on = ['cat','In2'], right_on = ['cat', 'index'])
df2 = df2[['cat','attribute1','attribute2']].set_index('cat')
# Join the two dataframes together.
df = pd.concat([df1, df2], axis=1)
# Name the columns as desired
df.columns = ['in1a1', 'in1a2', 'in2a1', 'in2a2']
One should generally try to avoid iterating through DataFrames, because it's not very efficient. But it's definitely a possible solution here.
# Read the csvs
data1 = pd.read_csv('data1.csv')
data2 = pd.read_csv('data2.csv')
# This list will be the data for the resulting DataFrame
rows = []
# Iterate through data1, unpacking values in each row to variables
for idx, cat, in1, in2 in data1.itertuples():
# Create a dictionary for each row where the keys are the column headers of the future DataFrame
row = {}
row['cat'] = cat
# Pick the correct row from data2
in1 = (data2['index'] == in1) & (data2['cat'] == cat)
in2 = (data2['index'] == in2) & (data2['cat'] == cat)
# Assign the correct values to the keys in the dictionary
row['in1a1'] = data2.loc[in1, 'attribute1'].values[0]
row['in1a2'] = data2.loc[in1, 'attribute2'].values[0]
row['in2a1'] = data2.loc[in2, 'attribute1'].values[0]
row['in2a2'] = data2.loc[in2, 'attribute2'].values[0]
# Append the dictionary to the list
rows.append(row)
# Construct a DataFrame from the list of dictionaries
df = pd.DataFrame(rows)

Related

Python pandas convert csv file into wide long txt file and put the values that have the same name in the "MA" column in the same row

I want to get a file from the csv file formatted as follows:
CSV file:
Desired output txt file (Header italicized):
MA Am1 Am2 Am3 Am4
MX1 X Y - -
MX2 9 10 11 12
Any suggestions on how to do this? Thank you!
Need help with writing the python code for achieving this. I've tried to loop through every row, but still struggling to find a way to write this.
You can try this.
Based on unique MA value groups, get the values [names column here]
Create a new dataframe with it.
Expand the values list to columns and add it to new dataframe.
Copy name column from first data frame.
Reorder 'name' column.
Code:
import pandas as pd
df = pd.DataFrame([['MX1', 1, 222],['MX1', 2, 222],['MX2', 4, 44],['MX2', 3, 222],['MX2', 5, 222]], columns=['name','values','etc'])
df_new = pd.DataFrame(columns = ['name', 'values'])
for group in df.groupby('name'):
df_new.loc[-1] = [group[0], group[1]['values'].to_list()]
df_new.index = df_new.index + 1
df_new = df_new.sort_index()
df_expanded = pd.DataFrame(df_new['values'].values.tolist()).add_prefix('Am')
df_expanded['name'] = df_new['name']
cols = df_expanded.columns.tolist()
cols = cols[-1:] + cols[:-1]
df_expanded = df_expanded[cols]
print(df_expanded.fillna('-'))
Output:
name Am0 Am1 Am2
0 MX2 4 3 5.0
1 MX1 1 2 -

Adding a pandas.dataframe to another one with it's own name

I have data that I want to retrieve from a couple of text files in a folder. For each file in the folder, I create a pandas.DataFrame to store the data. For now it works correctly and all the fils has the same number of rows.
Now what I want to do is to add each of these dataframes to a 'master' dataframe containing all of them. I would like to add each of these dataframes to the master dataframe with their file name.
I already have the file name.
For example, let say I have 2 dataframes with their own file names, I want to add them to the master dataframe with a header for each of these 2 dataframes representing the name of the file.
What I have tried now is the following:
# T0 data
t0_path = "C:/Users/AlexandreOuimet/Box Sync/Analyse Opto/Crunch/GF data crunch/T0/*.txt"
t0_folder = glob.glob(t0_path)
t0_data = pd.DataFrame()
for file in t0_folder:
raw_data = parseGFfile(file)
file_data = pd.DataFrame(raw_data, columns=['wavelength', 'max', 'min'])
file_name = getFileName(file)
t0_data.insert(loc=len(t0_data.columns), column=file_name, value=file_data)
Could someone help me with this please?
Thank you :)
Edit:
I think I was not clear enough, this is what I am expecting as an output:
output
You may be looking for the concat function. Here's an example:
import pandas as pd
A = pd.DataFrame({'Col1': [1, 2, 3], 'Col2': [4, 5, 6]})
B = pd.DataFrame({'Col1': [7, 8, 9], 'Col2': [10, 11, 12]})
a_filename = 'a_filename.txt'
b_filename = 'b_filename.txt'
A['filename'] = a_filename
B['filename'] = b_filename
C = pd.concat((A, B), ignore_index = True)
print(C)
Output:
Col1 Col2 filename
0 1 4 a_filename.txt
1 2 5 a_filename.txt
2 3 6 a_filename.txt
3 7 10 b_filename.txt
4 8 11 b_filename.txt
5 9 12 b_filename.txt
There are a couple changes to make here in order to make this happen in an easy way. I'll list the changes and reasoning below:
Specified which columns your master DataFrame will have
Instead of using some function that it seems like you were trying to define, you can simply create a new column called "file_name" that will be the filepath used to make the DataFrame for every record in that DataFrame. That way, when you combine the DataFrames, each record's origin is clear. I commented that you can make edits to that particular portion if you want to use string methods to clean up the filenames.
At the end, don't use insert. For combining DataFrames with the same columns (a union operation if you're familiar with SQL or with set theory), you can use the append method.
# T0 data
t0_path = "C:/Users/AlexandreOuimet/Box Sync/Analyse Opto/Crunch/GF data crunch/T0/*.txt"
t0_folder = glob.glob(t0_path)
t0_data = pd.DataFrame(columns=['wavelength', 'max', 'min','file_name'])
for file in t0_folder:
raw_data = parseGFfile(file)
file_data = pd.DataFrame(raw_data, columns=['wavelength', 'max', 'min'])
file_data['file_name'] = file #You can make edits here
t0_data = t0_data.append(file_data,ignore_index=True)

How to compare 2 huge CSV files, based on column names specified at run time and ignoring few columns?

I need to write a program that compares 2 CSV files and reports the differences in an excel file. It compares the records based on a Primary key (and sometimes a few Secondary keys) ignoring a list of other columns specified. All these parameters are read from an excel.
I have written a code that does this and works okay for small files but the performance is very poor for huge files (some files that are to be compared have way more than 200K rows).
The current logic uses csv.DictReader to read the files. I iterate over the rows of first file reading row by row, each time finding the corresponding record in the second file (comparing Primary and Secondary keys). If the record is found, I then compare all the columns ignoring those specified in the excel. If there is a difference in any of the columns, I write both records in the excel report highlighting the difference.
Below is the code I have so far. It would be very kind if someone could provide any tips to optimize this program or suggest a different approach.
primary_key = wb['Parameters'].cell(6,2).value #Read Primary Key
secondary_keys = [] #Read Secondary Keys into a list
col = 4
while wb['Parameters'].cell(6,col).value:
secondary_keys.append(wb['Parameters'].cell(6,col).value)
col += 1
len_secondary_keys = len(secondary_keys)
ignore_col = [] #Read Columns to be ignored into a list
row = 8
while wb['Parameters'].cell(row,2).value:
ignore_col.append(wb['Parameters'].cell(row,2).value)
row += 1
with open (filename1) as csv_file_1, open (filename2) as csv_file_2:
file1_reader = csv.DictReader(filename1, delimiter='~')
for row_file1 in file1_reader:
record_found = False
file2_reader = csv.DictReader(filename2, delimiter='~')
for row_file2 in file2_reader:
if row_file2[primary_key] == row_file1[primary_key]:
for key in secondary_keys:
if row_file2[key] != row_file1[key]:
break
compare(row_file1, row_file2)
record_found = True
break
if not record_found:
report_not_found(sheet_name1, row_file1, row_no_file1)
def compare(row_file1, row_file2):
global row_diff
data_difference = False
for key in row_file1:
if key not in ignore_col:
if (row_file1[key] != row_file2[key]):
data_difference = True
break
if data_difference:
c = 1
for key in row_file1:
wb_report['DW_Diff'].cell(row = row_diff, column = c).value = row_file1[key]
wb_report['DW_Diff'].cell(row = row_diff+1, column = c).value = row_file2[key]
if (row_file1[key] != row_file2[key]):
wb_report['DW_Diff'].cell(row = row_diff+1, column = c).fill = PatternFill(patternType='solid',
fill_type='solid',
fgColor=Color('FFFF0000'))
c += 1
row_diff += 2
You are running into speed issues because of the structure of your comparison. You are using a nested loop comparing each entry in one collection to every entry in another, which is O(N^2) slow.
One way you could modify your code slightly is to redo the way you ingest the data and instead of using csv.DictReader to make a list of dictionaries for each file, would be to create a single dictionary of each file manually using the the primary & secondary keys as dictionary keys. This way you could compare entries between the two dictionaries very easily, and with constant time.
This construct assumes that you have unique primary/secondary keys in each file, which it seems like you are assuming from above.
Here is a toy example. In this I'm just using an integer and animal type as a tuple for the (primary key, secondary key) key
In [7]: file1_dict = {(1, 'dog'): [45, 22, 66], (3, 'bird'): [55, 20, 1], (15, '
...: cat'): [6, 8, 90]}
In [8]: file2_dict = {(1, 'dog'): [45, 22, 66], (3, 'bird'): [4, 20, 1]}
In [9]: file1_dict
Out[9]: {(1, 'dog'): [45, 22, 66], (3, 'bird'): [55, 20, 1], (15, 'cat'): [6, 8, 90]}
In [10]: file2_dict
Out[10]: {(1, 'dog'): [45, 22, 66], (3, 'bird'): [4, 20, 1]}
In [11]: for k in file1_dict:
...: if k in file2_dict:
...: if file1_dict[k] == file2_dict[k]:
...: print('matched %s' % str(k))
...: else:
...: print('different %s' % str(k))
...: else:
...: print('no corresponding key for %s' % str(k))
...:
matched (1, 'dog')
different (3, 'bird')
no corresponding key for (15, 'cat')
I was able to achieve this using the Pandas library as suggested by #Vaibhav Jadhav using the below steps:
1. Import the 2 CSV files into dataframes.
e.g.:
try:
data1 = pd.read_csv(codecs.open(filename1, 'rb', 'utf-8', errors = 'ignore'), sep = delimiter1, dtype='str', error_bad_lines=False)
print (data1[keys[0]])
except:
data1 = pd.read_csv(codecs.open(filename1, 'rb', 'utf-16', errors = 'ignore'), sep = delimiter1, dtype='str', error_bad_lines=False)
Delete the columns not to be compared from both the dataframes.
for col in data1.columns:
if col in ignore_col:
del data1[col]
del data2[col]
Merge the 2 dataframes with indicator=True
merged = pd.merge(data1, data2, how='outer', indicator=True)
From the merged dataframe, delete the rows that were available in both dataframes.
merged = merged[merged._merge != 'both']
Sort the dataframe with the key(s)
merged.sort_values(by = keys, inplace = True, kind = 'quicksort')
Iterate the rows of the dataframe, compare keys of the first 2 rows. If the keys are different row1 exists only in one of the 2 CSV files. If keys are same iterate over individual columns and compare to find which column value is different.
It is a good use case for Apache Beam.
Features like "groupbykey" will make matching by keys more efficient.
Using an appropriate runner you can efficiently scale to much larger datasets.
Possibly there is no Excel IO, but you could output to a csv, database etc.
https://beam.apache.org/documentation/
https://beam.apache.org/documentation/transforms/python/aggregation/groupbykey/
https://beam.apache.org/documentation/runners/capability-matrix/
https://beam.apache.org/documentation/io/built-in/

Pandas- How to save frequencies of different values in different columns line by line in a csv file (including 0 frequencies)

I have a CSV file with the following columns of interest
fields = ['column_0', 'column_1', 'column_2', 'column_3', 'column_4', 'column_5', 'column_6', 'column_7', 'column_8', 'column_9']
for each of these columns, there are 153 lines of data, containing only two values: -1 or +1
My problem is that, for each column, I would like to save the frequencies of each -1 and +1 values in comma-separated style line by line in a CSV file. I have the following problems when I do the following:
>>>df = pd.read_csv('data.csv', skipinitialspace=True, usecols=fields)
>>>print df['column_2'].value_counts()
1 148
-1 5
>>>df['column_2'].value_counts().to_csv('result.txt', index=False )
Then, when I open results.txt, here is what I found
148
5
Which is obviously what I dont want, I want the values in the same line of the text file separated by comma (e.g., 148, 5).
The second problem I have happens when one of the frequencies are zero,
>>> print df['column_9'].value_counts()
1 153
>>> df['column_9'].value_counts().to_csv('result.txt', index=False )
Then, when I open results.txt, here is what I found
153
I also dont want that behavior, I would like to see 153, 0
So, in summary, I would like to know how to do that with Pandas
Given one column, save its different values frequencies in the same line of a csv file and separated by commas. For example:
148,5
If there is a value with frequency 0, put that in the CSV. For example:
153,0
Append these frequency values in different lines of the same CSV file. For example:
148,5
153,0
Can I do that with pandas? or should I move to other python lib?
Example with some dummy data:
import pandas as pd
df = pd.DataFrame({'col1': [1, 1, 1, -1, -1, -1],
'col2': [1, 1, 1, 1, 1, 1],
'col3': [-1, 1, -1, 1, -1, -1]})
counts = df.apply(pd.Series.value_counts).fillna(0).T
print(counts)
Output:
-1 1
col1 3.0 3.0
col2 0.0 6.0
col3 4.0 2.0
You can then export this to csv.
See this answer for ref:
How to get value counts for multiple columns at once in Pandas DataFrame?
I believe you could do what you want like this
import io
import pandas as pd
df = pd.DataFrame({'column_1': [1,-1,1], 'column_2': [1,1,1]})
with io.StringIO() as stream:
# it's easier to transpose a dataframe so that the number of rows become columns
# .to_frame to DataFrame and .T to transpose
df['column_1'].value_counts().to_frame().T.to_csv(stream, index=False)
print(stream.getvalue()) # check the csv data
But I would suggest something like this since you would have to otherwise specify that one of the expected values were missing
with io.StringIO() as stream:
# it's easier to transpose a dataframe so that the number of rows become columns
# .to_frame to DataFrame and .T to transpose
counts = df[['column_1', 'column_2']].apply(lambda column: column.value_counts())
counts = counts.fillna(0)
counts.T.to_csv(stream, index=False)
print(stream.getvalue()) # check the csv data
Here is an example with three columns c1, c2, c3 and data frame d which is defined before the function is invoked.
import pandas as pd
import collections
def wcsv(d):
dc=[dict(collections.Counter(d[i])) for i in d.columns]
for i in dc:
if -1 not in list(i.keys()):
i[-1]=0
if 1 not in list(i.keys()):
i[1]=0
w=pd.DataFrame([ list(j.values()) for j in dc],columns=['1','-1'],index=['c1','c2','c3'])
w.to_csv("t.csv")
d=pd.DataFrame([[1,1,-1],[-1,1,1],[1,1,-1],[1,1,-1]],columns=['c1','c2','c3'])
wcsv(d)

Python: Add rows with different column names to dict/dataframe

I want to add data (dictionaries) to a dictionary, where every added dictionary represent a new row. It is a iterative process and it is not known what column names a new added dictionary(row) could have. In the end I want a pandas dataframe. Furthermore I have to write the dataframe every 1500 rows to a file ( which is a problem, because after 1500 rows, it could of course happen that new data is added which has columns that are not present in the already written 1500 rows to the file).
I need a approach which is very fast (maybe 26ms per row). My approach is slow, because it has to check every data if it has new column names and in the end it has to reread the file, to create a new file where all columns have the same lengths. The data comes from a queue which is processed in another process.
import pandas as pd
def writingData(exportFullName='path', buffer=1500, maxFiles=150000, writingQueue):
imagePassed = 0
with open(exportFullName, 'a') as f:
columnNamesAllList = []
columnNamesAllSet = set()
dfTempAll = pd.DataFrame(index=range(buffer), columns=columnNamesAllList)
columnNamesUpdated = False
for data in iter(writingQueue.get, "STOP"):
print(imagesPassed)
dfTemp = pd.DataFrame([data],index=[imagesPassed])
if set(dfTemp).difference(columnNamesAllSet):
columnNamesAllSet.update(set(dfTemp))
columnNamesAllList.extend(list(dfTemp))
columnNamesUpdated = True
else:
columnNamesUpdated = False
if columnNamesUpdated:
print('Updated')
dfTempAll = dfTemp.combine_first(dfTempAll)
else:
dfTempAll.iloc[imagesPassed - 1] = dfTemp.iloc[0]
imagesPassed += 1
if imagesPassed == buffer:
dfTempAll.dropna(how='all', inplace=True)
dfTempAll.to_csv(f, sep='\t', header=True)
dfTempAll = pd.DataFrame(index=range(buffer), columns=columnNamesAllList)
imagePassed = 0
Reading it in again:
dfTempAll = pd.DataFrame( index=range(maxFiles), columns=columnNamesAllList)
for number, chunk in enumerate(pd.read_csv(exportFullName, delimiter='\t', chunksize=buffer, low_memory=True, memory_map=True,engine='c')):
dfTempAll.iloc[number*buffer:(number+1*buffer)] = pd.concat([chunk, columnNamesAllList]).values#.to_csv(f, sep='\t', header=False) # , chunksize=buffer
#dfTempAll = pd.concat([chunk, dfTempAll])
dfTempAll.reset_index(drop=True, inplace=True).to_csv(exportFullName, sep='\t', header=True)
Small example with dataframes
So to make it clear. Lets say I have a 4 row already existent dataframe (in the real case it could have 150000 rows like in the code above), where 2 rows are already filled with data and I add a new row it could look like this with the exception that the new data is a dictionary in the raw input:
df1 = pd.DataFrame(index=range(4),columns=['A','B','D'], data={'A': [1, 2, 'NaN', 'NaN'], 'B': [3, 4,'NaN', 'NaN'],'D': [3, 4,'NaN', 'NaN']})
df2 = pd.DataFrame(index=[2],columns=['A','C','B'], data={'A': [0], 'B': [0],'C': [0] })#

Categories