Convert .txt file to .csv with specific columns PYTHON - python

I have some text file that I want to load into my python code, but the format of the txt file is not suitable.
Here is what it contains:
SEQ MSSSSWLLLSLVAVTAAQSTIEEQAKTFLDKFNHEAEDLFYQSSLASWNY
SS3 CCCHHHHHHHHHHHHCCCCCCHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
95024445656543114678678999999999999999888889998886
SS8 CCHHHHHHHHHHHHHHCCCCCHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
96134445555554311253378999999999999999999999999987
SA EEEbBBBBBBBBBBbEbEEEeeEeBeEbBEEbbEeBeEbbeebBbBbBbb
41012123422000000103006262214011342311110000030001
TA bhHHHHHHHHHHHHHgIihiHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
00789889988663201010099999999999999999898999998741
CD NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
54433221111112221122124212411342243234323333333333
I want to convert it into panda Dataframe to have SEQ SS4 SA TA CD SS8 as columns of the DataFrame and the line next to them as the rows.
Like this:
I tried pd.read_csv but it doesn't give me the result I want.
Thank you !

To read a text file using pandas.read_csv() method, the text file should contain data separated with comma.
SEQ, SS3, ....
MSSSSWLLLSLVAVTAAQSTIEEQ..., CCCHHHHHHHHHHHHCCCCCCHHHHHHH.....

Steps
Use pd.read_fwf() to read files in a fixed-width format.
Fill the missing values with the last available value by df.ffill().
Assign group number gp for the row number in the output using a groupby-cumcount construct.
Move gp=(0,1) to columns by df.pivot, and then transpose again into the desired output.
Note: this solution works with arbitrary (includes zero, and of course not too many) consecutive lines with omitted values in the first column.
Code
# data (3 characters for the second column only)
file_path = "/mnt/ramdisk/input.txt"
df = pd.read_fwf(file_path, names=["col", "val"])
# fill the blank values
df["col"].ffill(inplace=True)
# get correct row location
df["gp"] = df.groupby("col").cumcount()
# pivot group (0,1) to columns and then transpose.
df_ans = df.pivot(index="col", columns="gp", values="val").transpose()
Result
print(df_ans) # show the first 3 characters only
col CD SA SEQ SS3 SS8 TA
gp
0 NNN EEE MSS CCC CCH bhH
1 544 410 NaN 950 961 007
Then you can save the resulting DataFrame using df_ans.to_csv().

You can use this script to load the .txt file to DataFrame and save it as csv file:
import pandas as pd
data = {}
with open('<your file.txt>', 'r') as f_in:
for line in f_in:
line = line.split()
if len(line) == 2:
data[line[0]] = [line[1]]
df = pd.DataFrame(data)
print(df)
df.to_csv('data.csv', index=False)
Saves this CSV:

Related

In Pandas, how can I extract certain value using the key off of a dataframe imported from a csv file?

Using Pandas, I'm trying to extract value using the key but I keep failing to do so. Could you help me with this?
There's a csv file like below:
value
"{""id"":""1234"",""currency"":""USD""}"
"{""id"":""5678"",""currency"":""EUR""}"
I imported this file in Pandas and made a DataFrame out of it:
dataframe from a csv file
However, when I tried to extract the value using a key (e.g. df["id"]), I'm facing an error message.
I'd like to see a value 1234 or 5678 using df["id"]. Which step should I take to get it done? This may be a very basic question but I need your help. Thanks.
The csv file isn't being read in correctly.
You haven't set a delimiter; pandas can automatically detect a delimiter but hasn't done so in your case. See the read_csv documentation for more on this. Because the , the pandas dataframe has a single column, value, which has entire lines from your file as individual cells - the first entry is "{""id"":""1234"",""currency"":""USD""}". So, the file doesn't have a column id, and you can't select data by id.
The data aren't formatted as a pandas df, with row titles and columns of data. One option is to read in this data is to manually process each row, though there may be slicker options.
file = 'test.dat'
f = open(file,'r')
id_vals = []
currency = []
for line in f.readlines()[1:]:
## remove obfuscating characters
for c in '"{}\n':
line = line.replace(c,'')
line = line.split(',')
## extract values to two lists
id_vals.append(line[0][3:])
currency.append(line[1][9:])
You just need to clean up the CSV file a little and you are good. Here is every step:
# open your csv and read as a text string
with open('My_CSV.csv', 'r') as f:
my_csv_text = f.read()
# remove problematic strings
find_str = ['{', '}', '"', 'id:', 'currency:','value']
replace_str = ''
for i in find_str:
my_csv_text = re.sub(i, replace_str, my_csv_text)
# Create new csv file and save cleaned text
new_csv_path = './my_new_csv.csv' # or whatever path and name you want
with open(new_csv_path, 'w') as f:
f.write(my_csv_text)
# Create pandas dataframe
df = pd.read_csv('my_new_csv.csv', sep=',', names=['ID', 'Currency'])
print(df)
Output df:
ID Currency
0 1234 USD
1 5678 EUR
You need to extract each row of your dataframe using json.loads() or eval()
something like this:
import json
for row in df.iteritems():
print(json.loads(row.value)["id"])
# OR
print(eval(row.value)["id"])

Filtering rows on a csv with values that I have on a txt file

So I have a csv file where I need to filter the rows based on the values that I have on a txt file. Is there an easy way to do this on pandas? The csv will have about 2000 rows and the txt file has about 400 data points. I need to generate a csv with rows that match the data on the txt file.
The CSV file looks like this:
Chromosome Gene Start End
1 PERM1 5 6
2 AGRN 7 10
3 MIB2 9 12
The Text file looks like
PERM1
NADK
GNB1
Thank you
First read Text file into a list or tuple:
lines = tuple(open(filename, 'r'))
Then filter the lines exist in the text file:
df = read_csv('csvfile')
result = df[df['Chromosome Gene'].isin(lines)]
easy enough using pandas read and filter functionality. I'm assuming you have an input .csv file called input_csv_file and a filter file called filter.csv. Input file has a column "filter_locatitons" and input_file has a column called "locations":
input_df = pd.read_csv('input_csv_file.csv')
filter_df = pd.read_csv('filter.csv')
filtered_df = input_df[[input_df['location'].isin(filter_df['filter_locations']]
This can be achieved using mask and loading both the files in dataframes. Below code assumes your test file does not have a header and your csv file is space delimited
import pandas as pd
df1 = pd.read_csv('csvfile.csv', delimiter=' ')
df2 = pd.read_csv('textfile.txt', header=None)
df2.columns = ['Gene']
m = df1.Gene.isin(df2.Gene)
df3 = df1[m]
print(df3)

Python script that efficiently drops columns from a CSV file

I have a csv file, where the columns are separated by tab delimiter but the number of columns is not constant. I need to read the file up to the 5th column. (I dont want to ready the whole file and then extract the columns, I would like to read for example line by line and skip the remaining columns)
You can use usecols argument in pd.read_csv to limit the number of columns to be read.
# test data
s = '''a,b,c
1,2,3'''
with open('a.txt', 'w') as f:
print(s, file=f)
df1 = pd.read_csv("a.txt", usecols=range(1))
df2 = pd.read_csv("a.txt", usecols=range(2))
print(df1)
print()
print(df2)
# output
# a
#0 1
#
# a b
#0 1 2
You can use pandas nrows to read only a certain number of csv lines:
import pandas as pd
df = pd.read_csv('out122.txt', usecols=[0,1,2,3,4])

Opening csv file using python

I have a .csv file having multiple rows and columns:
chain Resid Res K B C Tr Kw Bw Cw Tw
A 1 ASP 1 0 0.000104504 NA 0 0 0.100087974 0.573972285
A 2 GLU 2 627 0.000111832 0 0.033974309 0.004533331 0.107822844 0.441666022
Whenever I open the file using pandas or using with open, it shows that there are only column and multiple rows:
629 rows x 1 columns
Here is the code im using:
data= pd.read_csv("F:/file.csv", sep='\t')
print(data)
and the result I'm getting is this"
A,1,ASP,1,0,0.0001045041279130...
I want the output to be in a dataframe form so that I can carry out future calculations. Any help will be highly appreciated
There is separator ,, so is psosible omit parameter sep, because sep=',' is deafault separator in read_csv:
data= pd.read_csv("F:/file.csv")
you can read the csv using the following code snippet
import pandas as pd
data = pd.read_csv('F:/file.csv', sep=',')
Don't use '\t', because you don't have four consecutive spaces together (a tab between), so use the below:
data = pd.read_csv("F:/file.csv")
Or if really needed, use:
data = pd.read_csv("F:/file.csv", sep='\s{2,}', engine='python')
If your data values have spaces.

Selectin Dataframe columns name from a csv file

I have a .csv to read into a DataFrame and the names of the columns are in the same .csv file in the previos rows. Usually I drop all the 'unnecesary' rows to create the DataFrame and then hardcode the names of each dataframe
Trigger time,2017-07-31,10:45:38
CH,Signal name,Input,Range,Filter,Span
CH1, "Tin_MIX_Air",TEMP,PT,Off,2000.000000,-200.000000,degC
CH2, "Tout_Fan2b",TEMP,PT,Off,2000.000000,-200.000000,degC
CH3, "Tout_Fan2a",TEMP,PT,Off,2000.000000,-200.000000,degC
CH4, "Tout_Fan1a",TEMP,PT,Off,2000.000000,-200.000000,degC
Here you can see the rows where the columns names are in double quotes "TinMix","Tout..",etc there are exactly 16 rows with names
Logic/Pulse,Off
Data
Number,Date&Time,ms,CH1,CH2,CH3,CH4,CH5,CH7,CH8,CH9,CH10,CH11,CH12,CH13,CH14,CH15,CH16,CH20,Alarm1-10,Alarm11-20,AlarmOut
NO.,Time,ms,degC,degC,degC,degC,degC,degC,%RH,%RH,degC,degC,degC,degC,degC,Pa,Pa,A,A1234567890,A1234567890,A1234
1,2017-07-31 10:45:38,000,+25.6,+26.2,+26.1,+26.0,+26.3,+25.7,+43.70,+37.22,+25.6,+25.3,+25.1,+25.3,+25.3,+0.25,+0.15,+0.00,LLLLLLLLLL,LLLLLLLLLL,LLLL
And here the values of each variables start.
What I need to do is create a Dataframe from this .csv and place these names in the columns names. I'm new to Python and I'm not very sure how to do it
import pandas as pd
path = r'path-to-file.csv'
data=pd.DataFrame()
with open(path, 'r') as f:
for line in f:
data = pd.concat([data, pd.DataFrame([tuple(line.strip().split(','))])], ignore_index=True)
data.drop(data.index[range(0,29)],inplace=True)
x=len(data.iloc[0])
data.drop(data.columns[[0,1,2,x-1,x-2,x-3]],axis=1,inplace=True)
data.reset_index(drop=True,inplace=True)
data = data.T.reset_index(drop=True).T
data = data.apply(pd.to_numeric)
This is what I've done so far to get my dataframe with the usefull data, I'm dropping all the other columns that arent useful to me and keeping only the values. Last three lines are to reset row/column indexes and to transform the whole df to floats. What I would like is to name the columns with each of the names I showed in the first piece of coding as a I said before I'm doing this manually as:
data.columns = ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p']
But I would like to get them from the .csv file since theres a possibility on changing the CH# - "Name" combination
Thank you very much for the help!
Comment: possible for it to work within the other "OPEN " loop that I have?
Assume Column Names from Row 2 up to 6, Data from Row 7 up to EOF.
For instance (untested code)
data = None
columns = []
with open (path) as fh:
for row, line in enumerate (fh, 1):
if row > 2 and row <= 6:
ch, name = line.split(',')[:2]
columns.append(name)
else:
row_data = [tuple(line.strip().split(','))]
if not data:
data = pd.DataFrame(row_data, columns=columns, ignore_index=True)
else:
data.append(row_data)
Question: ... I would like to get them from the .csv file
Start with:
with open (path) as fh:
for row, line in enumerate (fh, 1):
if row > 2:
ch, name = line.split(',')[:2]

Categories