How to convert textfile to different pandas dataframes?

How to convert textfile to different pandas dataframes? - python

My program gives an output in a .txt file. There are 3 different tables in this output. I need to convert these three tables into pandas dataframes. I'm not sure what is the best way to approach this.
This is how my .txt output file looks like:
column_header standard_content (Old) standard_content (New)
214 STAFF_ORIGIN_IND_NATIVE_AMER N Y
215 STAFF_ORIGIN_IND_PACIF_ISLND N Y
128 STUDENT_INFORMATION_RELEASE N Y
211 STAFF_ORIGIN_IND_ASIAN N Y
105 STUDENT_ORIGIN_IND_NATIVE_AMER N Y
104 STUDENT_ORIGIN_IND_HISPANIC N Y
160 STUDENT_OUTSIDE_CATCHMENT N Y
346 COURSE_EXTRA_POINT_ELIGIBLE N Y
528 SUBSTITUTE_REQUIRED N Y
527 STAFF_ABSENCE_AUTHORIZED N Y
column_header data_req (Old) data_req (New)
20 SCHOOL_SIZE_GROUP N Y
241 STAFF_CONTACT N Y
346 COURSE_EXTRA_POINT_ELIGIBLE N Y
434 DISCIPLINE_FED_OFFENSE_GROUP N Y
32 SCHOOL_ATTENDANCE_TYPE N Y
142 STUDENT_COUNTRY_OF_BIRTH N Y
74 FACILITY_COUNTY_CODE N Y
64 FACILITY_PARKING_SPACES N Y
436 DISCIPLINE_DIST_OFFENSE_GROUP N Y
321 STAFF_BARGAINING_UNIT N Y
column_header element_type (Old) element_type (New)
331 DISTRICT_CODE Key Local
511 DISTRICT_CODE Key Local
445 DISTRICT_CODE Key Local
2 DISTRICT_CODE Key Local
302 STAFF_ASSIGN_FINANCIAL_CODE Key Local
493 SCHEDULE_SEQUENCE Key Local
461 INCIDENT_ID Key Local
431 INCIDENT_ID Key Local
159 STUDENT_CATCHMENT_CODE Key Local
393 DISTRICT_CODE Key Local
I tried to use this in a loop but it creates a single dataframe and it gets messed up.
df = pd.read_fwf(io.StringIO(report)
df.to_csv('data.csv')
result_df = pd.read_csv('data.csv', )
print("Final report", result_df)
Is there a way I can create a new dataframe based on a keyword, for example 'column_header', or any other way I can do this?

Do this in few steps.
Slurp the entire file
split according to a delimiter (empty lines)
read each part into a separate dataframe
if we let RAW_DATA be the content of your file, this could be done with
dfs = [pd.read_fwf(StringIO(part),
header=None, skiprows=1,
names=['id', 'header', 'old', 'new'])
for part in raw_data.strip().split('\n\n')]
The split looks for empty lines. The read_fwf call uses several pandas TextParser options to skip the header row and explicitly name the columns(the actual column headers throw off the fixed width parser).
The first frame will look like
id header old new
0 214 STAFF_ORIGIN_IND_NATIVE_AMER N Y
1 215 STAFF_ORIGIN_IND_PACIF_ISLND N Y
2 128 STUDENT_INFORMATION_RELEASE N Y
3 211 STAFF_ORIGIN_IND_ASIAN N Y
4 105 STUDENT_ORIGIN_IND_NATIVE_AMER N Y
5 104 STUDENT_ORIGIN_IND_HISPANIC N Y
6 160 STUDENT_OUTSIDE_CATCHMENT N Y
7 346 COURSE_EXTRA_POINT_ELIGIBLE N Y
8 528 SUBSTITUTE_REQUIRED N Y
9 527 STAFF_ABSENCE_AUTHORIZED N Y

Related

Remove numbers and user's stop words from pandas data frame

I would like to know how to remove some variables from a dataset, specifically numbers and list of strings. For example.
Test Num
0 bam 132
1 - 65
2 creation 47
3 MAN 32
4 41 831
... ... ...
460 Luchino 21
461 42 4126 7
462 finger 43
463 washing 1
I would like to have something like
Test Num
0 bam 132
2 creation 47
... ... ...
460 Luchino 21
462 finger 43
463 washing 1
where I removed (manually) MAN (it should be included in a list of strings, like a stop word), -, and numbers.
I have tried with isdigit but it is not working so I am sure that there are errors in my code:
df['Text'].where(~df['Text'].str.isdigit())
and for my stop words:
my_stop=['MAN','-']
df['Text'].apply(lambda lst: [x for x in lst if x in my_stop])

If you want to filter you could use .loc
df = df.loc[~df.Text.str.isdigit() & ~df.Text.isin(['MAN']), :]
.where(cond, other) returns a dataframe or series of the same shape as self, but keeps the original values where cond is true and replaces with other where it is false.
Read more in the docs

hi you should try this code :
df[df['Text']!='MAN']

python - creating bin labels from nparray

Having the following nparray describing my bin edges, created like so:
np.arange(min_value , max_value + 1, bin_size)
[ -1 35 71 107 143 179 215 251 287 323 359]
I would like to create string lables array like so:
['0-36','36-72','72-108','108-144','144-180','180-216','216-252','252-288','288-324','324-360']
What would be the way to do it?

Use list comprehension with f-strings:
b = [f'{i+1}-{j+1}' for i, j in zip(a[:-1], a[1:])]
print (b)
['0-36', '36-72', '72-108', '108-144', '144-180',
'180-216', '216-252', '252-288', '288-324', '324-360']
a += 1
b = [f'{i}-{j}' for i, j in zip(a[:-1], a[1:])]

Selecting rows with lowest values based on combination two columns from pandas

I'm not even sure if the title makes sense.
I have a pandas dataframe with 3 columns: x, y, time. There are a few thousand rows. Example below:
x y time
0 225 0 20.295270
1 225 1 21.134015
2 225 2 21.382298
3 225 3 20.704367
4 225 4 20.152735
5 225 5 19.213522
.......
900 437 900 27.748966
901 437 901 20.898460
902 437 902 23.347935
903 437 903 22.011992
904 437 904 21.231041
905 437 905 28.769945
906 437 906 21.662975
.... and so on
What I want to do is retrieve those rows which have the smallest time associated with x and y. Basically for every element on the y, I want to find which have the smallest time value but I want to exclude those that have time 0.0. This happens when x has the same value as y.
So for example, the fastest way to get to y-0 is by starting from x-225 and so on, therefore it could be the case that x repeats itself but for a different y.
e.g.
x y time
225 0 20.295270
438 1 19.648954
27 20 4.342732
9 438 17.884423
225 907 24.560400
I tried up until now groupby but I'm only getting the same x as y.
print(df.groupby('id_y', sort=False)['time'].idxmin())
y
0 0
1 1
2 2
3 3
4 4
The one below just returns the df that I already have.
df.loc[df.groupby("id_y")["time"].idxmin()]
Just to point out one thing, I'm open to options, not just groupby, if there are other ways that is very good.

So need remove rows with time equal first by boolean indexing and then use your solution:
df = df[df['time'] != 0]
df2 = df.loc[df.groupby("y")["time"].idxmin()]
Similar alternative with filter by query:
df = df.query('time != 0')
df2 = df.loc[df.groupby("y")["time"].idxmin()]
Or use sort_values with drop_duplicates:
df2 = df[df['time'] != 0].sort_values(['y','time']).drop_duplicates('y')

How can I extract excel data by column name?

My excel data looks like this:
A B C
1 123 534 576
2 456 745 345
3 234 765 285
In another excel spreadsheet, my data may look like this:
B C A
1 123 534 576
2 456 745 345
3 234 765 285
How can I extract column C's contents from both spreadsheets?
My code is as follows:
#Open the workbook
ow = xlrd.open_workbook('export.xlsx').sheet_by_index(0)
#Store column 3's data inside an array
ips = ow.col_values(2, 1)
I would like something more like: ips = ow.col_values(C, 1)
How can I achieve the above?

Since I have two different spreadsheets, with the data that I'm wanting are in two separate rows, I have to search the first row by name until I find it, then extract that column.
Here's how I did it:
ow = xlrd.open_workbook('export.xlsx').sheet_by_index(0)
for x in range (0, 20):
try:
if ow.cell_value(0, x) == "IP Address":
print "found it!"
ips = ow.col_values(x, 1)
break
except IndexError:
continue

Efficiently finding intersecting regions in two huge dictionaries

I wrote a piece of code that finds common ID's in line[1] of two different files.My input file is huge (2 mln lines). If I split it into many small files it gives me more intersecting ID's, while if I throw the whole file to run, much less. I cannot figure out why, can you suggest me what is wrong and how to improve this code to avoid the problem?
fileA = open("file1.txt",'r')
fileB = open("file2.txt",'r')
output = open("result.txt",'w')
dictA = dict()
for line1 in fileA:
listA = line1.split('\t')
dictA[listA[1]] = listA
dictB = dict()
for line1 in fileB:
listB = line1.split('\t')
dictB[listB[1]] = listB
for key in dictB:
if key in dictA:
output.write(dictA[key][0]+'\t'+dictA[key][1]+'\t'+dictB[key][4]+'\t'+dictB[key][5]+'\t'+dictB[key][9]+'\t'+dictB[key][10])
My file1 is sorted by line[0] and has 0-15 lines,
contig17 GRMZM2G052619_P03 98 109 2 0 15 67 78.8 0 127 5 420 0 304 45
contig33 AT2G41790.1 98 420 2 0 21 23 78.8 1 127 5 420 2 607 67
contig98 GRMZM5G888620_P01 87 470 1 0 17 28 78.8 1 127 7 420 2 522 18
contig102 GRMZM5G886789_P02 73 115 1 0 34 45 78.8 0 134 5 421 0 456 50
contig123 AT3G57470.1 83 201 2 1 12 43 78.8 0 134 9 420 0 305 50
My file2 is not sorted and has 0-10 line,
GRMZM2G052619 GRMZM2G052619_P03 4 2345 GO:0043531 ADP binding "Interacting selectively and non-covalently with ADP" [GOC:jl] molecular_function PF07525 1
GRMZM5G888620 GRMZM5G888620_P01 1 2367 GO:0011551 DNA binding "Any molecular function by which a gene product interacts selectively and non-covalently with DNA" [GOC:jl] molecular_function PF07589 4
GRMZM5G886789 GRMZM5G886789_P02 1 4567 GO:0055516 ADP binding "Interacting selectively and non-covalently with ADP" [GOC:jl] molecular_function PF07526 0
My desired output,
contig17 GRMZM2G052619_P03 GO:0043531 ADP binding molecular_function PF07525
contig98 GRMZM5G888620_P01 GO:0011551 DNA binding molecular_function PF07589
contig102 GRMZM5G886789_P02 GO:0055516 ADP binding molecular_function PF07526

I really recommend you to use PANDAS to cope with this kind of problem.
for proof that can be simply done with pandas:
import pandas as pd #install this, and read de docs
from StringIO import StringIO #You dont need this
#simulating a reading the file
first_file = """contig17 GRMZM2G052619_P03 x
contig33 AT2G41790.1 x
contig98 GRMZM5G888620_P01 x
contig102 GRMZM5G886789_P02 x
contig123 AT3G57470.1 x"""
#simulating reading the second file
second_file = """y GRMZM2G052619_P03 y
y GRMZM5G888620_P01 y
y GRMZM5G886789_P02 y"""
#here is how you open the files. Instead using StringIO
#you will simply the file path. Give the correct separator
#sep="\t" (for tabular data). Here im using a space.
#In name, put some relevant names for your columns
f_df = pd.read_table(StringIO(first_file),
header=None,
sep=" ",
names=['a', 'b', 'c'])
s_df = pd.read_table(StringIO(second_file),
header=None,
sep=" ",
names=['d', 'e', 'f'])
#this is the hard bit. Here I am using a bit of my experience with pandas
#Basicly it select the rows in the second data frame, which "isin"
#in the second columns for each data frames.
my_df = s_df[s_df.e.isin(f_df.b)]
Output:
Out[180]:
d e f
0 y GRMZM2G052619_P03 y
1 y GRMZM5G888620_P01 y
2 y GRMZM5G886789_P02 y
#you can save this with:
my_df.to_csv("result.txt", sep="\t")
chers!

This is almost the same but within a function.
#Creates a function to do the reading for each file
def read_store(file_, dictio_):
"""Given a file name and a dictionary stores the values
of the file in a dictionary by its value on the column provided."""
import re
with open(file_,'r') as file_0:
lines_file_0 = fileA.readlines()
for line in lines_file_0:
ID = re.findall("^.+\s+(\w+)", line)
#I couldn't check it but it should match whatever is after a separate
# character that has letters, numbers or underscore
dictio_[ID] = line
To use do:
file1 = {}
read_store("file1.txt", file1)
And then compare it normally as you do, but I would to use \s instead of \t to split. Even though it will split also between words, but that is easy to rejoin with " ".join(DictA[1:5])

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to convert textfile to different pandas dataframes? - python

Related

Remove numbers and user's stop words from pandas data frame

python - creating bin labels from nparray

Selecting rows with lowest values based on combination two columns from pandas

How can I extract excel data by column name?

Efficiently finding intersecting regions in two huge dictionaries

Categories

Resources