I have a rather large text file with multiple columns that I must convert to a 15 column .csv file to be read in excel. The logic for parsing the fields I need is written out below, but I am having trouble writing it to .csv.
columns = [ 'TRANSACTN_NBR', 'RECORD_NBR',
'SEQUENCE_OR_PIC_NBR', 'CR_DB', 'RT_NBR', 'ACCOUNT_NBR',
'RSN_COD', 'ITEM_AMOUNT', 'ITEM_SERIAL', 'CHN_IND',
'REASON_DESCR', 'SEQ2', 'ARCHIVE_DATE', 'ARCHIVE_TIME', 'ON_US_IND' ]
for line in in_file:
values = line.split()
if 'PRINT DATE:' in line:
dtevalue = line.split(a,1)[-1].split(b)[0]
lines.append(dtevalue)
elif 'PRINT TIME:' in line:
timevalue = line.split(c,1)[-1].split(b)[0]
lines.append(timevalue)
elif (len(values) >= 4 and values[3] == 'C'
and len(values[2]) >= 2 and values[2][:2] == '41'):
print(values)
elif (len(values) >= 4 and values[3] == 'D'
and values[4] in rtnbr):
on_us = '1'
else:
on_us = '0'
print (lines[0])
print (lines[1])
I have originally tried the csv module but the parsed rows are written in 12 columns and I could not find a way to write the date and time (parsed separately) in the columns after each row
I was also looking at the pandas package but have only seen ways to extract patterns, which wouldn't work with the established parsed criteria
Is there a way to write to csv using the above criteria? Or do I have to scrap it and rewrite the code within a specific package?
Any help is appreciated
EDIT: Text file sample:
* START ******************************************************************************************************************** START *
* START ******************************************************************************************************************** START *
* START ******************************************************************************************************************** START *
1--------------------
1ANTECR09 CHEK DPCK_R_009
TRANSIT EXTRACT SUB-SYSTEM
CURRENT DATE = 08/03/2017 JOURNAL REPORT PAGE 1
PROCESS DATE =
ID = 022000046-MNT
FILE HEADER = H080320171115
+____________________________________________________________________________________________________________________________________
R T SEQUENCE CR BT A RSN ITEM ITEM CHN USER REASO
NBR NBR OR PIC NBR DB NBR NBR COD AMOUNT SERIAL IND .......FIELD.. DESCR
5,556 01 7450282689 C 538196640 9835177743 15 $9,064.81 00 CREDIT
5,557 01 7450282690 D 031301422 362313705 38 $592.35 43431 DR CR
5,558 01 7450282691 D 021309379 601298839 38 $1,491.04 44896 DR CR
5,559 01 7450282692 D 071108834 176885 38 $6,688.00 1454 DR CR
5,560 01 7450282693 D 031309123 1390001566241 38 $293.42 6878 DR CR
--------------------
34,615 207 4100223726 C 538196620 9866597322 10 $645.49 00 CREDIT
34,616 207 4100223727 D 022000046 8891636675 31 $645.49 111583 DR ON-
--------------------
34,617 208 4100223728 C 538196620 11701364 10 $756.19 00 CREDIT
34,618 208 4100223729 D 071923828 00 54 $305.31 11384597 BAD AC
34,619 208 4100223730 D 071923828 35110011 30 $450.88 10913052 6 DR SEL
--------------------
Desired output: looking at only lines containing seq starting with 42, contains C
1293 83834 4100225908 C 538196620 9860890913 10 161.5 0 CREDIT 41 3-Aug-17 11:15:51
1294 83838 4100225911 C 538196620 25715845 10 138 0 CREDIT 41 3-Aug-17 11:15:51
Look at the ‘pandas‘ package, more specifically the class DataFrame. With a little cleverness you ought to be able to read your table using ‘pandas.read_table()‘ which returns a dataframe that you can output to csv with ‘to_csv()‘ effectively a 2 line solution. You’ll need to look at the docs to find the parameters you’ll need to properly read your table format, but should be a little easier than doing it manually.
Related
This is my data frame:
Name Age Stream Percentage
0 A 21 Math 88
1 B 19 Commerce 92
2 C 20 Arts 95
3 D 18 Biology 70
0 E 21 Math 88
1 F 19 Commerce 92
2 G 20 Arts 95
3 H 18 Biology 70
I want to download different excel file for each subject in one loop so basically, I should get 4 excel files for each subject
i tried this but didn't work:
n=0
for subjects in df.stream:
df.to_excel("sub"+ str(n)+".xlsx")
n+=1
I think groupby is helpful here. and you can use enumerate to keep track of the index.
for i, (group, group_df) in enumerate(df.groupby('stream')):
group_df.to_excel('sub{}.xlsx'.format(i))
# Alternatively, to name the file based on the stream...
group_df.to_excel('sub{}.xlsx'.format(group))
group is going to be the name of the stream.
group_df is going to be a sub-dataframe containing all the data in that group.
I have a .txt file that has the data regarding the total number of queries with valid names. The text inside of the file came out of a SQL Server 19 query output. The database used consists of the results of an algorithm that retrieves the most similar brands related to the query inserted. The file looks something like this:
2 16, 42, 44 A MINHA SAÚDE
3 34 !D D DUNHILL
4 33 #MEGA
5 09 (michelin man)
5 12 (michelin man)
6 33 *MONTE DA PEDRA*
7 35 .FOX
8 33 #BATISTA'S BY PITADA VERDE
9 12 #COM
10 41 + NATUREZA HUMANA
11 12 001
12 12 002
13 12 1007
14 12 101
15 12 102
16 12 104
17 37 112 PC
18 33 1128
19 41 123 PILATES
The 1st column has the Query identifier, the 2nd one has the brand classes where the Query can be located and the 3rd one is the Query itself (the spaces came from the SQL Server output formatting).
I then made a Pandas DataFrame in Google Colaboratory where I wanted the columns to be like the ones in the text file. However, when I ran the code, it gave me this:
The code that I wrote is here:
# Dataframe with the total number of queries with valid names:
df = pd.DataFrame(pd.read_table("/content/drive/MyDrive/data/classes/100/queries100.txt", header=None, names=["Query ID", "Query Name", "Classes Where Query is Present"]))
df
I think that this happens because of the commas in the 2nd column but I'm not quite sure. Any suggestions on why this is happening? I already tried read_csv and read_fwf and they were even worse in terms of formatting.
You can use pd.read_fwf() in this case, as your columns have fixed widths:
import pandas as pd
df = pd.read_fwf(
"/content/drive/MyDrive/data/classes/100/queries100.txt",
colspecs=[(0,20),(21,40),(40,1000)],
header=None,
names=["Query ID", "Query Name", "Classes Where Query is Present"]
)
df.head()
# Query ID Query Name Classes Where Query is Present
# 0 2 16, 42, 44 A MINHA SAÚDE
# 1 3 34 !D D DUNHILL
# 2 4 33 #MEGA
# 3 5 09 (michelin man)
# 4 5 12 (michelin man)
I have a text file txt that has 6 columns:
1.sex (M /F) 2.age 3.height 4.weight 5.-/+ 6.zip code
I need to find from this text how many Males have - sign. ( for example: from the txt 30 M(Males) are - )
So I need only the number at the end.
Logically I need to work with Column1 and column 5 but I am struggling to get only one (sum) number at the end.
This is the content of the text:
M 87 66 133 - 33634
M 17 77 119 - 33625
M 63 57 230 - 33603
F 55 50 249 - 33646
M 45 51 204 - 33675
M 58 49 145 - 33629
F 84 70 215 - 33606
M 50 69 184 - 33647
M 83 60 178 - 33611
M 42 66 262 - 33682
M 33 75 176 + 33634
M 27 48 132 - 33607
I am getting the result now..., but I want both M and positive. How can I add that to occurrences??
f=open('corona.txt','r')
data=f.read()
occurrences=data.count('M')
print('Number of Males that have been tested positive:',occurrences)
You can split the lines like this:
occurrences = 0
with open('corona.txt') as f:
for line in f:
cells = line.split()
if cells[0] == "M" and cells[4] == "-":
occurrences += 1
print("Occurrences of M-:", occurrences)
But it is better to use the csv module or pandas for this type of work.
If you do any significant amount of work with text and columnar data, I would suggest getting started on learning pandas
For this task, if your csv is one record per line and is space-delimited:
import pandas as pd
d = pd.read_csv('data.txt',
names=['Sex', 'Age', 'Height', 'Weight', 'Sign', 'ZIP'],
sep=' ', index_col=False)
d[(d.Sex=='M') & (d.Sign=='-')].shape[0] # or
len(d[(d.Sex=='M') & (d.Sign=='-')]) # same result, in this case = 9
Pandas is a very extensive package. What this code does is build a DataFrame from your csv data, giving each column a name. Then selects from this, each row where both of your conditions Sex == 'M' and Sign == '-', and reports the number of records thus found.
I recommend starting here
With my Python code I'm looking for a cell with a specific table name, in this case 'Quality distribution'. In the Excel file there are two tables with this name and I only want to work with the first table.
My code is working correctly if there is only one cell with the specific table name, but now my code is finding the first cell with 'Quality distribution' and then goes looking for a second cell and starts the index at the second table. How can I adjust my code so that I work with the first table?
My Excel file contains out of 12 tables in columns A and B, and every table has 67 to 350 rows. Above the table a table name is stated.
An example (I have deletes some tables and rows for for the sheet has 2000 rows):
Summary
Creation date: Fri Aug 02 13:49:15 CEST 2019
Generated by: XXXX
Software: CLC Genomics Workbench 12.0
Based upon: 1 data set
XXXXXXX_S7_L001_R1_001 (paired): 5.102.482 sequences in pairs
Total sequences in data set 5.102.482 sequences
Total nucleotides in data set 558.462.117 nucleotides
Quality distribution
average PHRED score % sequences:
0 0
1 0
27 0.889841454
28 1.157475911
29 1.472773446
Per-base analysis
Coverage
base position % coverage:
0 100
1 100
2 100
147 37.30090572
148 36.1365508
149 33.95743483
150 24.3650639
151 0
Quality distribution
base position PHRED score: 5%ile PHRED score: 25%ile PHRED score: Median PHRED score: 75%ile PHRED score: 95%ile
0 0 0 0 0 0
1 18 32 32 33 34
2 18 32 33 33 34
3 18 32 33 34 34
146 15 37 38 39 39
147 15 37 38 39 39
148 15 37 38 39 39
149 15 37 38 39 39
150 15 36 38 39 39
151 13 33 37 38 39
#!/usr/bin/python3
import xlrd
kit = ('test_QC_150.xlsx')
wb = xlrd.open_workbook(kit)
sheet = wb.sheet_by_index(0)
def phred_score():
for sheet in wb.sheets():
for rowidx in range(sheet.nrows):
row = sheet.row(rowidx)
for colidx, cell in enumerate(row):
# searching for the quality distribution
if cell.value == "Quality distribution":
index_quality_distribution = rowidx
print('index_quality_distribution: ', index_quality_distribution)
index = index_quality_distribution + 35
index_end = index_quality_distribution + 67
print(index)
print(index_end)
def main():
phred_score()
if __name__ == '__main__':
main()
I think the answer is quite simple. Your code is not "wrong", you just haven't thought it through to the end:
Your for-loop runs through all the cells in the range you specified and you previously only had one cell that validated the if statement that follows:
for colidx, cell in enumerate(row):
# searching for the quality distribution
if cell.value == "Quality distribution":
index_quality_distribution = rowidx
now that there are two instances, it will find both, but since you are overwriting the "index_quality_distribution" variable, only the last one it finds will be kept "in memory". What you can do is wrap everything in a while-loop and break out of it when the index is found the first time:
while True:
for sheet in wb.sheets():
for rowidx in range(sheet.nrows):
row = sheet.row(rowidx)
for colidx, cell in enumerate(row):
# searching for the quality distribution
if cell.value == "Quality distribution":
index_quality_distribution = rowidx
print('index_quality_distribution: ', index_quality_distribution)
break #Exits the while-loop and stop iterating
break #failsafe in case no "Quality distribution is found
That should do it.
I have a text file (one.txt) that contains an arbitrary number of key‐value pairs (where the key and value are separated by a = – e.g. 1=8). Here are some examples:
1=88|11=1438|15=KKK|45=00|45=00|21=66|86=a
4=13|11=1438|49=DDD|8=157.73|67=00|45=00|84=b|86=a
6=84|41=18|56=TTT|67=00|4=13|45=00|07=d
I need to create a DataFrame with a list of dictionaries, with each row as one dictionary in the list like so:
[{1:88,11:1438,15:kkk,45:7.7....},{4:13,11:1438....},{6:84,41:18,56:TTT...}]
df = pd.read_csv("input.txt",names=['text'],header=None)
data = df['text'].str.split("|")
names=[ y.split('=') for x in data for y in x]
ds=pd.DataFrame(names)
print ds
How can I create a dictionary for each line by splitting on the = symbol?
It should be one row and multiple columns.
The DataFrame should have all keys as rows and values as columns.
Example:
1 11 15 45 21 86 4 49 8 67 84 6 41 56 45 07
88 1438 kkk 00 66 a
na 1438 na .....
I think performing a .pivot would work. Try this:
import pandas as pd
df = pd.read_csv("input.txt",names=['text'],header=None)
data = df['text'].str.split("|")
names=[ y.split('=') for x in data for y in x]
ds=pd.DataFrame(names)
ds = ds.pivot(columns=0).fillna('')
The .fillna('') removes the None values. If you'd like to replace with na you can use .fillna('na').
Output:
ds.head()
1
0 07 1 11 15 21 4 41 45 49 56 6 67 8 84 86
0 88
1 1438
2 KKK
3 00
4 00
For space I didn't print the entire dataframe, but it does column indexing based on the key and then values based on the values for each line (preserving the dict by line concept).