Python sort CSV File - python

Hey I have a CSV file with many rows but one of the row constantly repeats. Is it possible to only keep the first name for that row and keep all other data. I tried with pandas but pandas asks for a function such as aggregate sum. My data in the CSV file is like.
H1 h2 h3 h4
A 1 2 3 4
A 2 3 4 5
A 3 4 5 6
B 1 2 3 4
B 2 3 4 5
B 3 4 5 6
C 1 2 3 4
C 2 3 4 5
C 3 4 5 6
Each one of these has a header. Which are shown by h1-h4.
My data is not like this, it contains real text values.
I want to rearrange the data so it looks like this.
A
1 2 3 4
2 3 4 5
3 4 5 6
B
1 2 3 4
2 3 4 5
3 4 5 6
C
1 2 3 4
2 3 4 5
3 4 5 6
Or
A 1 2 3 4
2 3 4 5
3 4 5 6
B 1 2 3 4
2 3 4 5
3 4 5 6
C 1 2 3 4
2 3 4 5
3 4 5 6
So basically I want it to group by the first header name which is h1. Any help would be appreciated thanks.

The following should work, it assumes your source data is space delimited (as you have shown), if it uses commas or tabs, you will need to change the delimiter I have used.
import csv
with open("input.csv", "r") as f_input, open("output.csv", "wb") as f_output:
csv_input = csv.reader(f_input, delimiter=" ")
csv_output = csv.writer(f_output)
headers = next(csv_input)
cur_row = ""
for cols in csv_input:
if cur_row != cols[0]:
cur_row = cols[0]
csv_output.writerow([cur_row])
csv_output.writerow(cols[1:])
Giving you an output CSV file as follows:
A
1,2,3,4
2,3,4,5
3,4,5,6
B
1,2,3,4
2,3,4,5
3,4,5,6
C
1,2,3,4
2,3,4,5
3,4,5,6
Tested using Python 2.7
To add the headers for each group, change the first writerow line as follows:
csv_output.writerows([[cur_row], headers])
Giving the following output:
A
H1,h2,h3,h4
1,2,3,4
2,3,4,5
3,4,5,6
B
H1,h2,h3,h4
1,2,3,4
2,3,4,5
3,4,5,6
C
H1,h2,h3,h4
1,2,3,4
2,3,4,5
3,4,5,6

Related

How do I subset columns in a Pandas dataframe based on criteria using a loop?

I have a Pandas dataframe called "bag' with columns called beans1, beans2, and beans3
bag = pd.DataFrame({'beans1': [3,1,2,5,6,7], 'beans2': [2,2,1,1,5,6], 'beans3': [1,1,1,3,3,2]})
bag
Out[50]:
beans1 beans2 beans3
0 3 2 1
1 1 2 1
2 2 1 1
3 5 1 3
4 6 5 3
5 7 6 2
I want to use a loop to subset each column with observations greater than 1, so that I get:
beans1
0 3
2 2
3 5
4 6
5 7
beans2
0 2
1 2
4 5
5 6
beans3
3 3
4 3
5 2
The way to do it manually is :
beans1=beans.loc[bag['beans1']>1,['beans1']]
beans2=beans.loc[bag['beans2']>1,['beans2']]
beans3=beans.loc[bag['beans3']>1,['beans3']]
But I need to employ a loop, with something like:
for i in range(1,4):
beans+str(i).loc[beans.loc[bag['beans'+i]>1,['beans'+str(i)]]
But it didn't work. I need a Python version of R's eval(parse(text="")))
Any help appreciated. Thanks much!
It is possible, but not recommended, with globals:
for i in range(1,4):
globals()['beans' + str(i)] = bag.loc[bag['beans'+str(i)]>1,['beans'+str(i)]]
for c in bag.columns:
globals()[c] = bag.loc[bag[c]>1,[c]]
print (beans1)
beans1
0 3
2 2
3 5
4 6
5 7
Better is create dictionary:
d = {c: bag.loc[bag[c]>1, [c]] for c in bag}
print (d['beans1'])
beans1
0 3
2 2
3 5
4 6
5 7

I want to add new column on the basis of another column data in pandas

I have multiple csv file which i merged together after that in order to identify individual csv data in all merged csv file i wish to create a new column in pandas where the new column should be called serial.
I want a new column serial in the pandas and it should me numbered on the basis of data in Sequence column (For example-111111111,2222222222,33333333 for every new one in csv ).I had Attached snapshot of csv file also.
Sequence Number
1
2
3
4
5
1
2
1
2
3
4
I want output Like this-
Serial Sequence Number
1 1
1 2
1 3
1 4
1 5
2 1
2 2
3 1
3 2
3 3
3 4
Use DataFrame.insert for column in first position filled with boolean mask for compare by 1 with Series.eq (==) and cumulative sum by Series.cumsum:
df.insert(0, 'Serial', df['Sequence Number'].eq(1).cumsum())
print (df)
Serial Sequence Number
0 1 1
1 1 2
2 1 3
3 1 4
4 1 5
5 2 1
6 2 2
7 3 1
8 3 2
9 3 3
10 3 4

How to concat two python DataFrames where if the row already exits it doesn't add it, if not, append it

I'm pretty new to python.
I am trying to concat two dataframes (df1, df2) where if a row already exists in df1 then it is not added. if not, it adds to df1.
I don't want to use .concat().drop_duplicates() because I don't want duplicates within the same DataFrame to be removed.
BackStory:
I have multiple csv files that are exported from a software in different locations once in a while I want to merge these into one file. the problem is the exported files will have the same data as before along with the new records made within that period of time. therefore I need to check if the record is already in there as I will be executing the same code each time I export the data.
for the sake of example:
import pandas as pd
main_df = pd.DataFrame([[1,2,3,4],[1,2,3,4],[4,2,5,1],[2,4,1,5],[2,5,4,5],[9,8,7,6],[8,5,6,7]])
df1 = pd.DataFrame([[1,2,3,4],[1,2,3,4],[4,2,5,1],[2,4,1,5],[1,5,4,8],[7,3,5,7],[4,3,8,5],[4,3,8,5]])
main_df
0 1 2 3
0 1 2 3 4 --duplicates I want to include--
1 1 2 3 4 --duplicates I want to include--
2 4 2 5 1
3 2 4 1 5
4 2 5 4 5
5 9 8 7 6
6 8 5 6 7
df1
0 1 2 3
0 1 2 3 4 --duplicates I want to exclude--
1 1 2 3 4 --duplicates I want to exclude--
2 4 2 5 1 --duplicates I want to exclude--
3 2 4 1 5 --duplicates I want to exclude--
4 1 5 4 8
5 7 3 5 7
6 4 3 8 5 --duplicates I want to include--
7 4 3 8 5 --duplicates I want to include--
I need the end result to be
main_df (after code execution)
0 1 2 3
0 1 2 3 4
1 1 2 3 4
2 4 2 5 1
3 2 4 1 5
4 2 5 4 5
5 9 8 7 6
6 8 5 6 7
7 1 5 4 8
8 7 3 5 7
9 4 3 8 5
10 4 3 8 5
I hope I have explained my issue in a clear way. Thank you
Check for every row in df1 whether it exists in main_df using pandas apply, and turn that into a mask by negating it with the ~ operator. I like using functools partial to make explicit that we are comparing to main_df.
import pandas as pd
from functools import partial
main_df = pd.DataFrame([
[1,2,3,4],
[1,2,3,4],
[4,2,5,1],
[2,4,1,5],
[2,5,4,5],
[9,8,7,6],
[8,5,6,7]
])
df1 = pd.DataFrame([
[1,2,3,4],
[1,2,3,4],
[4,2,5,1],
[2,4,1,5],
[1,5,4,8],
[7,3,5,7],
[4,3,8,5],
[4,3,8,5]
])
def has_row(df, row):
return (df == row).all(axis = 1).any()
main_df_has_row = partial(has_row, main_df)
duplicate_rows = df1.apply(main_df_has_row, axis = 1)
df1_add = df1.loc[~duplicate_rows]
pd.concat([main_df, df1_add])

Pandas jumbling up data after selecting columns

I have a (2.3m x 33) size dataframe. As I always do when selecting columns to keep, I use
colsToKeep = ['A','B','C','D','E','F','G','H','I']
df = df[colsToKeep]
However, this time the data under these columns becomes completely jumbled up on running the code. Entries for row A might be in row D for example. Totally at random.
Has anybody experienced this kind of behavior before? There is nothing out of the ordinary about the data and the df is totally fine before running these lines. Code run before problem begins:
with open('file.dat','r') as f:
df = pd.DataFrame(l.rstrip().split() for l in f)
#rename columns with the first row
df.columns = df.iloc[0]
#drop first row which is now duplicated
df = df.iloc[1:]
#. 33 nan columns - Remove all the nan columns that appeared
df = df.loc[:,df.columns.notnull()]
colsToKeep = ['A','B','C','D','E','F','G','H','I']
df = df[colsToKeep]
Data suddenly goes from being nicely formatted such as:
A B C D E F G H I
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
to something more random like:
A B C D E F G H I
7 9 3 4 5 1 2 8 6
3 2 9 2 1 6 7 8 4
2 1 3 6 5 4 7 9 8

How to remove columns from a df, if columns are not in df2

I have multiple dfs i need to compare, however the way the data was gathered one df has 25 columns and another 20 columns. Keep in mind the column label names are the same (the 20 columns exist in the 25 columns df).
I can't figure out how to remove columns from df_cont, if they don't exist in df_red + not include columns in df_red, which are not currently df_cont
df_cont A B C D E F
01-01-2019 1 2 3 4 5 5
02-01-2019 1 3 4 4 6 5
df_red A B D F G
01-01-2019 2 5 6 4 3
02-01-2019 2 5 6 4 3
Code:
df_cont1 = df_cont.query(df_cont.columns == df_red.columns)
Expected:
df_cont1 A B D F
01-01-2019 1 2 4 5
02-01-2019 1 3 4 5
As #busybear already stated you can use
df_cont = df_cont[df_red.columns]
in your special case.
This alternative solution is a bit safer if you don't know which DataFrame is the bigger one:
df_cont[df_cont.columns.intersection(df_red.columns)]

Categories