How to take data from specific column and seperate it? - python

I have question in CSV, i have csv data like this:
Runway | Data1 | Data2 | Data3 |
13 | 425 | 23 | Go Straight |
| 222 | 24 | Go Straight |
| 424 | 25 | Go Left |
---------------------------------------
16 | 555 | 13 | Go Right |
| 858 | 14 | Go Right |
| 665 | 15 | Go Straight |
How can i turn it into seperate runway that look like this:
Runway | Data1 | Data2 | Data3 | S | Runway | Data1 | Data2 | Data3 |
13 | 425 | 23 | Go Straight | P | 16 | 555 | 13 | Go Right |
| 222 | 24 | Go Straight | A | | 858 | 14 | Go Right |
| 424 | 25 | Go Left | C | |
| | | | E |
Is this possible to do? Thank You

Sorry for take that long, here is my code, I tryied to be simple.
import csv
with open("file_name.csv", "r") as file:
csv_reader = csv.reader(file, delimiter = ',')
csv_reader = list(csv_reader)
# Get table header and pop it
header = csv_reader.pop(0)
# Recursive function to extract and large row
def getLargeRow(rows, csvRows = []):
if (len(rows) != 0):
largeRow = [rows.pop(0)]
while (rows != [] and rows[0][0] == ''):
largeRow.append(rows.pop(0))
csvRows.append(largeRow)
return getLargeRow(rows, csvRows)
else:
return csvRows
# Now we have all rows as an list of lists
rows = getLargeRow(csv_reader)
# Assuming that all large rows got the same height (same number of regular rows)
largeRowsHeight = len(rows[0])
# Amount of large rows
largeRowsAmount = len(rows)
print(rows)
# The new text of a new csv file
newCsvFileText = ''
for i in range(largeRowsAmount):
newCsvFileText += ','.join(header)
if i < largeRowsAmount - 1:
newCsvFileText += ',,'
newCsvFileText += '\n'
for i in range(largeRowsHeight):
for j, row in enumerate(rows):
newCsvFileText += ','.join(row[i])
if j < len(rows) - 1:
newCsvFileText += ',,'
newCsvFileText += '\n'
# Save into a new file
with open("new_file.csv", "w") as newfile:
newfile.write(newCsvFileText)

Related

Issues appending to a list in python

I have the following data file.
>| --- | | Adelaide | | --- | | 2021 | | --- | | Rnd | T | Opponent | Scoring | F | Scoring | A | R | M | W-D-L | Venue | Crowd |
> Date | | R1 | H | Geelong | 4.4 11.7 13.9 15.13 | 103 | 2.3 5.5 10.8
> 13.13 | 91 | W | 12 | 1-0-0 | Adelaide Oval | 26985 | Sat 20-Mar-2021 4:05 PM | | R2 | A | Sydney | 3.2 4.6 6.14 11.22 | 88 |
> 4.1 9.6 15.11 18.13 | 121 | L | -33 | 1-0-1 | S.C.G. | 23946 | Sat 27-Mar-2021 1:45 PM |
I created a code to manipulate that data to my desired results which is a list. When I print my variable row at the current spot it prints correctly.
However, when I append my list row to another list my_array I have issues. I get an empty list returned.
I think the issue is the placement of where I am appending?
My code is this:
with open('adelaide.md', 'r') as f:
my_array = []
team = ''
year = ''
for line in f:
row=[]
line = line.strip()
fields = line.split('|')
num_fields = len(fields)
if len(fields) == 3:
val = fields[1].strip()
if val.isnumeric():
year = val
elif val != '---':
team = val
elif num_fields == 15:
row.append(team)
row.append(year)
for i in range(1, 14):
row.append(fields[i].strip())
print(row)
my_array.append(row)
You need to append the row array inside the for loop.
I think last line should be inside the for loop. Your code is probably appending the last 'row' list. Just give it a tab.

Reducing memory footprint of a pandas groupby function

I have a huge dataset with contents such as given below:
+------+------------------------------------------------------------------+----------------------------------+--+
| HHID | VAL_CD64 | VAL_CD32 | |
+------+------------------------------------------------------------------+----------------------------------+--+
| 203 | 8c5bfd9b6755ffcdb85dc52a701120e0876640b69b2df0a314dc9e7c2f8f58a5 | 373aeda34c0b4ab91a02ecf55af58e15 | |
| 203 | 0511dc19cb09f8f4ba3d140754dafb1471dacdbb6747cdb5a2bc38e278d229c8 | 6f3606577eadacef1b956307558a1efd | |
| 203 | a18adc1bcae1b570a610b13565b82e5647f05fef8a4680bd6ccdd717cdd34af7 | 332321ab150879e930869c15b1d10c83 | |
| 720 | f6c581becbac4ec1291dc4b9ce566334b1cb2c85e234e489e7fd5e1393bd8751 | 2c4f97a04f02db5a36a85f48dab39b5b | |
| 720 | abad845107a699f5f99575f8ed43e0440d87a8fc7229c1a1db67793561f0f1c3 | 2111293e946703652070968b224875c9 | |
| 348 | 25c7cf022e6651394fa5876814a05b8e593d8c7f29846117b8718c3dd951e496 | 5c80a555fcda02d028fc60afa29c4a40 | |
| 348 | 67d9c0a4bb98900809bcfab1f50bef72b30886a7b48ff0e9eccf951ef06542f9 | 6c10cd11b805fa57d2ca36df91654576 | |
| 348 | 05f1e412e7765c4b54a9acfd70741af545564f6fdfe48b073bfd3114640f5e37 | 6040b29107adf1a41c4f5964e0ff6dcb | |
| 403 | 3e8da3d63c51434bcd368d6829c7cee490170afc32b5137be8e93e7d02315636 | 71a91c4768bd314f3c9dc74e9c7937e8 | |
+------+------------------------------------------------------------------+----------------------------------+--+
I'm processing the file in order to have output in below given format:
+------+------------------------------------------------------------------+------------------------------------------------------------------+------------------------------------------------------------------+----------------------------------+----------------------------------+----------------------------------+--+
| HHID | VAL1_CD64 | VAL2_CD64 | VAL3_CD64 | VAL1_CD32 | VAL2_CD32 | VAL3_CD32 | |
+------+------------------------------------------------------------------+------------------------------------------------------------------+------------------------------------------------------------------+----------------------------------+----------------------------------+----------------------------------+--+
| 203 | 8c5bfd9b6755ffcdb85dc52a701120e0876640b69b2df0a314dc9e7c2f8f58a5 | 0511dc19cb09f8f4ba3d140754dafb1471dacdbb6747cdb5a2bc38e278d229c8 | a18adc1bcae1b570a610b13565b82e5647f05fef8a4680bd6ccdd717cdd34af7 | 373aeda34c0b4ab91a02ecf55af58e15 | 6f3606577eadacef1b956307558a1efd | 332321ab150879e930869c15b1d10c83 | |
| 720 | f6c581becbac4ec1291dc4b9ce566334b1cb2c85e234e489e7fd5e1393bd8751 | abad845107a699f5f99575f8ed43e0440d87a8fc7229c1a1db67793561f0f1c3 | | 2c4f97a04f02db5a36a85f48dab39b5b | 2111293e946703652070968b224875c9 | | |
| 348 | 25c7cf022e6651394fa5876814a05b8e593d8c7f29846117b8718c3dd951e496 | 67d9c0a4bb98900809bcfab1f50bef72b30886a7b48ff0e9eccf951ef06542f9 | 05f1e412e7765c4b54a9acfd70741af545564f6fdfe48b073bfd3114640f5e37 | 5c80a555fcda02d028fc60afa29c4a40 | 6c10cd11b805fa57d2ca36df91654576 | 6040b29107adf1a41c4f5964e0ff6dcb | |
| 403 | 3e8da3d63c51434bcd368d6829c7cee490170afc32b5137be8e93e7d02315636 | | | 71a91c4768bd314f3c9dc74e9c7937e8 | | | |
+------+------------------------------------------------------------------+------------------------------------------------------------------+------------------------------------------------------------------+----------------------------------+----------------------------------+----------------------------------+--+
My current code is:
import pandas as pd
import numpy as np
import os
import shutil
import glob
import time
start=time.time()
print('\nFile Processing Started\n')
path=r'C:\Users\xyz\Sample Data'
input_file=r'C:\Users\xyz\Sample Data\test'
output_file=r'C:\Users\xyz\Sample Data\test_MOD'
chunk=pd.read_csv(input_file+'.psv',sep='|',chunksize=10000,dtype={"HH_ID":"string","VAL_CD64":"string","VAL_CD32":"string"})
chunk_list=[]
for c_no in chunk:
chunk_list.append(c_no)
file_no=1
rec_cnt=0
for i in chunk_list:
start2=time.time()
rec_cnt=rec_cnt+len(i)
rec_cnt2=0
rec_cnt2=len(i)
df=pd.DataFrame(i)
df_ = df.groupby('HH_ID').agg({'VAL_CD64': list, 'VAL_CD32': list})
data = []
for col in df_.columns:
d = pd.DataFrame(df_[col].values.tolist(), index=df_.index)
d.columns = [f'{col}_{i}' for i in map(str, range(1, len(d.columns)+1))]
data.append(d)
res = pd.concat(data, axis=1)
# res.columns=['MAID1_SHA256', 'MAID2_SHA256', 'MAID3_SHA256', 'MAID1_MD5','MAID2_MD5', 'MAID3_MD5']
res.to_csv(output_file+str(file_no)+'.psv',index=True,sep='|')
with open(output_file+str(file_no)+'.psv','r') as istr:
with open(input_file+str(file_no)+'.psv','w') as ostr:
for line in istr:
line=line.strip('\n')+'|'
print(line,file=ostr)
os.remove(output_file+str(file_no)+'.psv')
file_no+=1
end2=time.time()
duration2=end2-start2
print("\nProcessed "+ str(rec_cnt2)+ " records in "+ str(round((duration2),2))+ " seconds. \nTotal Processed Records: "+str(rec_cnt))
os.remove(input_file+'.psv')
allFiles = glob.glob(path + "/*.psv")
allFiles.sort()
with open(os.path.join(path,'someoutputfile.csv'), 'wb') as outfile:
for i, fname in enumerate(allFiles):
with open(fname, 'rb') as infile:
if i != 0:
infile.readline()
shutil.copyfileobj(infile, outfile)
test=os.listdir(path)
for item in test:
if item.endswith(".psv"):
os.remove(os.path.join(path,item))
final_file_name=input_file+'.psv'
os.rename(os.path.join(path,'someoutputfile.csv'),final_file_name)
end=time.time()
duration=end-start
print("\n"+ str(rec_cnt)+ " records added in "+ str(round((duration),2))+ " seconds. \n")
However, this code is taking a lot of time to process a 400 million records file, approx 18-19 hours, running on unix. And the whole script gets killed if I try to process a 700 million records file. By my google search, I believe it is being killed due to high memory usage of groupby function.
Is there any way I can reduce the memory footprint of this program, so that a 700 million file can be processed through it?
I'm not sure how to do it with pandas, but you can do this without ever keeping more than a few rows in memory.
First, make sure the dataset is sorted by the column you want to group by. If not, sort them using an external merge sort algorithm.
Then, just follow this simple algorithm
read the first HHID, and start a new list of VAL_CD64 and VAL_CD32
while there are more lines
read the next line
if the HHID is the same as the previous, add VAL_CD64 and VAL_CD32 to the current lists
else
write out the previous HHID and cumulated values,
start collecting a new list for the new HHID
write out the last HHID and cumulated values

How to aggregate and restructure dataframe data in pyspark (column wise)

I am trying to aggregate data in pyspark dataframe on a particular criteria. I am trying to align the acct based on switchOUT amount to switchIN amount. So that accounts with money switching out of becomes from account and other accounts become to_accounts.
Data I am getting in the dataframe to begin with
+--------+------+-----------+----------+----------+-----------+
| person | acct | close_amt | open_amt | switchIN | switchOUT |
+--------+------+-----------+----------+----------+-----------+
| A | 1 | 125 | 50 | 75 | 0 |
+--------+------+-----------+----------+----------+-----------+
| A | 2 | 100 | 75 | 25 | 0 |
+--------+------+-----------+----------+----------+-----------+
| A | 3 | 200 | 300 | 0 | 100 |
+--------+------+-----------+----------+----------+-----------+
To this table
+--------+--------+-----------+----------+----------+
| person | from_acct| to_acct | switchIN | switchOUT|
+--------+----------+--------+----------+-----------+
| A | 3 | 1 | 75 | 100 |
+--------+----------+--------+----------+-----------+
| A | 3 | 2 | 25 | 100 |
+--------+----------+--------+----------+-----------+
And also how can I do it so that it works for N number of rows (not just 3 accounts)
So far I have used this code
# define udf
def sorter(l):
res = sorted(l, key=operator.itemgetter(1))
return [item[0] for item in res]
def list_to_string(l):
res = 'from_fund_' +str(l[0]) + '_to_fund_'+str(l[1])
return res
def listfirstAcc(l):
res = str(l[0])
return res
def listSecAcc(l):
res = str(l[1])
return res
sort_udf = F.udf(sorter)
list_str = F.udf(list_to_string)
extractFirstFund = F.udf(listfirstAcc)
extractSecondFund = F.udf(listSecAcc)
# Add additional columns
df= df.withColumn("move", sort_udf("list_col").alias("sorted_list"))
df= df.withColumn("move_string", list_str("move"))
df= df.withColumn("From_Acct",extractFirstFund("move"))
df= df.withColumn("To_Acct",extractSecondFund("move"))
Current outcome I am getting:
+--------+--------+-----------+----------+----------+
| person | from_acct| to_acct | switchIN | switchOUT|
+--------+----------+--------+----------+-----------+
| A | 3 | 1,2 | 75 | 100 |
+--------+----------+--------+----------+-----------+

Best way to compare 2 dfs, get the name of different col & before + after vals?

What is the best way to compare 2 dataframes w/ the same column names, row by row, if a cell is different have the Before & After value and which cellis different in that dataframe.
I know this question has been asked a lot, but none of the applications fit my use case. Speed is important. There is a package called datacompy but it is not good if I have to compare 5000 dataframes in a loop (i'm only comparing 2 at a time, but around 10,000 total, and 5000 times).
I don't want to join the dataframes on a column. I want to compare them row by row. Row 1 with row 1. Etc. If a column in row 1 is different, I only need to know the column name, the before, and the after. Perhaps if it is numeric I could also add a column w/ the abs val. of the dif.
The problem is, there is sometimes an edge case where rows are out of order (only by 1 entry), and don’t want these to come up as false positives.
Example:
These dataframes would be created when I pass in race # (there are 5,000 race numbers)
df1
+-----+-------+--+------+--+----------+----------+-------------+--+
| Id | Speed | | Name | | Distance | | Location | |
+-----+-------+--+------+--+----------+----------+-------------+--+
| 181 | 10.3 | | Joe | | 2 | | New York | |
| 192 | 9.1 | | Rob | | 1 | | Chicago | |
| 910 | 1.0 | | Fred | | 5 | | Los Angeles | |
| 97 | 1.8 | | Bob | | 8 | | New York | |
| 88 | 1.2 | | Ken | | 7 | | Miami | |
| 99 | 1.1 | | Mark | | 6 | | Austin | |
+-----+-------+--+------+--+----------+----------+-------------+--+
df2:
+-----+-------+--+------+--+----------+----------+-------------+--+
| Id | Speed | | Name | | Distance | | | Location |
+-----+-------+--+------+--+----------+----------+-------------+--+
| 181 | 10.3 | | Joe | | 2 | | New York | |
| 192 | 9.4 | | Rob | | 1 | | Chicago | |
| 910 | 1.0 | | Fred | | 5 | | Los Angeles | |
| 97 | 1.5 | | Bob | | 8 | | New York | |
| 99 | 1.1 | | Mark | | 6 | | Austin | |
| 88 | 1.2 | | Ken | | 7 | | Miami | |
+-----+-------+--+------+--+----------+----------+-------------+--+
diff:
+-------+----------+--------+-------+
| Race# | Diff_col | Before | After |
+-------+----------+--------+-------+
| 123 | Speed | 9.1 | 9.4 |
| 123 | Speed | 1.8 | 1.5 |
An example of a false positive is with the last 2 rows, Ken + Mark.
I could summarize the differences in one line per race, but if the dataframe has 3000 records and there are 1,000 differences (unlikely, but possible) than I will have tons of columns. I figured this was was easier as I could export to excel and then sort by race #, see all the differences, or by diff_col, see which columns are different.
def DiffCol2(df1, df2, race_num):
is_diff = False
diff_cols_list = []
row_coords, col_coords = np.where(df1 != df2)
diffDf = []
alldiffDf = []
for y in set(col_coords):
col_df1 = df1.iloc[:,y].name
col_df2 = df2.iloc[:,y].name
for index, row in df1.iterrows():
if df1.loc[index, col_df1] != df2.loc[index, col_df2]:
col_name = col_df1
if col_df1 != col_df2: col_name = (col_df1, col_df2)
diffDf.append({‘Race #’: race_num,'Column Name': col_name, 'Before: df2.loc[index, col_df2], ‘After’: df1.loc[index, col_df1]})
try:
check_edge_case = df1.loc[index, col_df1] == df2.loc[index+1, col_df1]
except:
check_edge_case = False
try:
check_edge_case_two = df1.loc[index, col_df1] == df2.loc[index-1, col_df1]
except:
check_edge_case_two = False
if not (check_edge_case or check_edge_case_two):
col_name = col_df1
if col_df1 != col_df2:
col_name = (col_df1, col_df2) #if for some reason column name isn’t the same, which should never happen but in case, I want to know both col names
is_diff = True
diffDf.append({‘Race #’: race_num,'Column Name': col_name, 'Before: df2.loc[index, col_df2], ‘After’: df1.loc[index, col_df1]})
return diffDf, alldiffDf, is_diff
[apologies in advance for weirdly formatted tables, i did my best given how annoying pasting tables into s/o is]
The code below works if dataframes have the same number and names of columns and the same number of rows, so comparing only values in the tables
Not sure where you want to get Race# from
df1 = pd.DataFrame(np.random.randn(10, 4), columns=list('ABCD'))
df2 = df1.copy(deep=True)
df2['B'][5] = 100 # Creating difference
df2['C'][6] = 100 # Creating difference
dif=[]
for col in df1.columns:
for bef, aft in zip(df1[col], df2[col]):
if bef!=aft:
dif.append([col, bef, aft])
print(dif)
Results below
Alternative solution without loops
df = df1.melt()
df.columns=['Column', 'Before']
df.insert(2, 'After', df2.melt().value)
df[df.Before!=df.After]

How to extract particular set of value from a file in Python?

I am stuck with the logic here... i have to extract some values from a text file that looks like this
AAA
+-------------+------------------+
| ID | count |
+-------------+------------------+
| 3 | 1445 |
| 4 | 105 |
| 9 | 160 |
| 10 | 30 |
+-------------+------------------+
BBB
+-------------+------------------+
| ID | count |
+-------------+------------------+
| 3 | 1445 |
| 4 | 105 |
| 9 | 160 |
| 10 | 30 |
+-------------+------------------+
CCC
+-------------+------------------+
| ID | count |
+-------------+------------------+
| 3 | 1445 |
| 4 | 105 |
| 9 | 160 |
| 10 | 30 |
+-------------+------------------+
I am not able to extract value from BBB alone and append it to a list like
f = open(sys.argv[1], "r")
text = f.readlines()
B_Values = []
for i in text:
if i.startswith("BBB"):(Example)
B_Values.append("only values of BBB")
if i.startswith("CCC"):
break
print B_Values
should result
['| 3 | 1445 |','| 4 | 105 |','| 9 | 160 |','| 10 | 30 |']
d = {}
with open(sys.argv[1]) as f:
for line in f:
if line[0].isalpha(): # is first character in the line a letter?
curr = d.setdefault(line.strip(), [])
elif filter(str.isdigit, line): # is there any digit in the line?
curr.append(line.strip())
for this file, d is now:
{'AAA': ['| 3 | 1445 |',
'| 4 | 105 |',
'| 9 | 160 |',
'| 10 | 30 |'],
'BBB': ['| 3 | 1445 |',
'| 4 | 105 |',
'| 9 | 160 |',
'| 10 | 30 |'],
'CCC': ['| 3 | 1445 |',
'| 4 | 105 |',
'| 9 | 160 |',
'| 10 | 30 |']}
Your B_values are d['BBB']
You can use a state flag bstarted to track when the B-group has begun.
After scanning the B-Group, delete the three header rows and the one footer row.
B_Values = []
bstarted = False
for i in text:
if i.startswith("BBB"):
bstarted = True
elif i.startswith("CCC"):
bstarted = False
break
elif bstarted:
B_Values.append(i)
del B_Values[:3] # get rid of the header
del B_Values[-1] # get rid of the footer
print B_Values
You should avoid iterating over the already read lines. Call readline whenever you want to read the next line and check to see what it is:
f = open(sys.argv[1], "r")
B_Values = []
while i != "":
i = f.readline()
if i.startswith("BBB"): #(Example)
for temp in range(3):
f.skipline() #Skip the 3 lines of table headers
i = f.readline()
while i != "+-------------+------------------+" and i !="":
#While we've not reached the table footer
B_Values.append(i)
i = f.readline()
break
#Although not necessary, you'd better put a close function there, too.
f.close()
print B_Values
EDIT: #eumiro 's method is more flexible than mine. Since it reads all the values from all sections. While you can implement isalpha testing in my example to read all the values, still his method is easier to read.

Categories