Pandas parsing excel file all in column A - python

I have a wireless radio readout that basically dumps all of the data into one column (column 'A') a of a spreadsheet (.xlsx). Is there anyway to parse the twenty plus columns into a dataframe for pandas? This is example of the data that is in column A of the excel file:
DSP ALLMSINFO:SECTORID=0,CARRIERID=0;
Belgium351G
+++ HUAWEI 2020-04-03 10:04:47 DST
O&M #4421590
%%/*35687*/DSP ALLMSINFO:SECTORID=0,CARRIERID=0;%%
RETCODE = 0 Operation succeeded
Display Information of All MSs-
------------------------------
Sector ID Carrier ID MSID MSSTATUS MSPWR(dBm) DLCINR(dB) ULCINR(dB) DLRSSI(dBm) ULRSSI(dBm) DLFEC ULFEC DLREPETITIONFATCTOR ULREPETITIONFATCTOR DLMIMOFLAG BENUM NRTPSNUM RTPSNUM ERTPSNUM UGSNUM UL PER for an MS(0.001) NI Value of the Band Where an MS Is Located(dBm) DL Traffic Rate for an MS(byte/s) UL Traffic Rate for an MS(byte/s)
0 0 0011-4D10-FFBA Enter -2 29 27 -56 -107 21 20 0 0 MIMO B 2 0 0 0 0 0 -134 158000 46000
0 0 501F-F63B-FB3B Enter 13 27 28 -68 -107 21 20 0 0 MIMO A 2 0 0 0 0 0 -134 12 8
Basically I just want to parse this data and have the table in a dataframe. Any help would be greatly appreciated.

You could try pandas read excel
df = pd.read_excel(filename, skip_rows=9)
This assumes we want to ignore the first 9 rows that don't make up the dataframe! Docs here https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html

Load the excel file and split the column on the spaces.
A problem may occur with "DLMIMOFLAG" because it has a space in the data and this will cause it to be split over two columns. It's optional whether this is acceptable or if the columns are merged back together afterwards.
Add the header manually rather than load it, otherwise all the spaces in the header will confuse the loading & splitting routines.
import numpy as np
import pandas as pd
# Start on the first data row - row 10
# Make sure pandas knows that only data is being loaded by using
# header=None
df = pd.read_excel('radio.xlsx', skiprows=10, header=None)
This gives a dataframe that is only data, all held in one column.
To split these out, make sure pandas has a reference to the first column with df.iloc[:,0], split the column based on spaces with str.split() and inform pandas the output will be a numpy list values.tolist().
Together this looks like:
df2 = pd.DataFrame(df.iloc[:,0].str.split().values.tolist())
Note the example given has an extra column because of the space in "DLMIMOFLAG" causing it to be split over two columns. This will be referred to as "DLMIMOFLAG_A" and "DLMIMOFLAG_B".
Now add on the column headers.
Optionally create a list first.
column_names = ["Sector ID", "Carrier ID", "MSID", "MSSTATUS", "MSPWR(dBm)", "DLCINR(dB)", "ULCINR(dB)",
"DLRSSI(dBm)", "ULRSSI(dBm)", "DLFEC", "ULFEC", "DLREPETITIONFATCTOR", "ULREPETITIONFATCTOR",
"DLMIMOFLAG_A", "DLMIMOFLAG_B", "BENUM", "NRTPSNUM", "RTPSNUM", "ERTPSNUM", "UGSNUM",
"UL PER for an MS(0.001)", "NI Value of the Band Where an MS Is Located(dBm)",
"DL Traffic Rate for an MS(byte/s)", "UL Traffic Rate for an MS(byte/s)",]
df2.columns = column_names
This gives the output as a full dataframe with column headers.
Sector ID Carrier ID MSID MSSTATUS
0 0 0011-4D10-FFBA Enter
0 0 501F-F63B-FB3B Enter

Related

Read file using Python/Pandas

I have a tab-delimited file with data like:
id Name address dept sal
1 abc "bangalore,
Karnataka,
Inida" 10 500
2 xyz "Hyderabad
Inida" 20 500
Here the columns are id,Name,address,dept, and sal.
The issue is with address columns that can contain a new line character. I tried different methods to read the file using Pandas and Python but instead of two rows, I am getting multiple rows as output.
Here are the few commands I tried:
file1 = open('C:/dummy/dummy.csv', 'r')
lines = file1.readlines()
for i in lines:
print(i)
and
df = pd.read_csv("C:/dummy/dummy.csv",sep='\t',quotechar='"')
Can anyone please help?
df = pd.read_csv("C:/dummy/dummy.csv",sep='\t',quotechar='"')
The corresponding output is, in case the columns are tab-delimited in the csv-file, as you say
id Name address dept sal
0 1 abc bangalore,\r\nKarnataka,\r\nInida 10 500
1 2 xyz Hyderabad\r\nInida 20 500
If you like to remove the CR-LF within the string, you can remove them via post-processing.
Additionally you could define the index-column via
df = pd.read_csv("C:/dummy/dummy.csv",sep='\t',quotechar='"',index_col=0)
What is your desired/expected output?

Writing pandas column to csv without merging integers

I have extracted user_id against shop_ids as pandas dataframe from database using SQL query.
user_id shop_ids
0 022221205 541
1 023093087 5088,4460,4460,4460,4460,4460,4460,4460,5090
2 023096023 2053,2053,2053,2053,2053,2053,2053,2053,2053,1...
3 023096446 4339,4339,3966,4339,4339
4 023098684 5004,3604,5004,5749,5004
I am trying to write this dataframe into csv using:
df.to_csv('users_ordered_shops.csv')
I end up with the csv merging the shop ids into one number as such:
user_id shop_ids
0 22221205 541
1 23093087 508,844,604,460,446,000,000,000,000,000,000,000
2 23096023 2,053,205,320,532,050,000,000,000,000,000,000,000,000,000,000,000,000
3 23096446 43,394,339,396,643,300,000
4 23098684 50,043,604,500,457,400,000
The values for index 2 are:
print(df.iloc[2].shop_ids)
2053,2053,2053,2053,2053,2053,2053,2053,2053,1294,1294,2053,1922
Expected output is a csv file with all shop_ids intact in one column or different columns like:
user_id shop_ids
0 022221205 541
1 023093087 5088,4460,4460,4460,4460,4460,4460,4460,5090
2 023096023 2053,2053,2053,2053,2053,2053,2053,2053,2053,1294,1294,2053,1922
3 023096446 4339,4339,3966,4339,4339
4 023098684 5004,3604,5004,5749,5004
Any tips on how to get the shop ids without merging when writing to a csv file? I have tried converting the shop_ids column using astype() to int and str which has resulted in the same output.
Update
To get one shop per column (and remove duplicates), you can use:
pd.concat([df['user_id'],
df['shop_ids'].apply(lambda x: sorted(set(x.split(','))))
.apply(pd.Series)],
axis=1).to_csv('users_ordered_shops.csv', index=False)
Change the delimiter. Try:
df.to_csv('users_ordered_shops.csv', sep=';')
Or change the quoting strategy:
import csv
df.to_csv('users_ordered_shops.csv', quoting=csv.QUOTE_NONNUMERIC)

How do I read from an Excel spreadsheet only rows meeting a certain condition into Python?

I need to create a pandas dataframe in Python by reading in an Excel spreadsheet that contains almost 50,000 rows and 81 columns. The file contains information about medical professionals of all kinds: physicians, nurses, nurse practitioners, etc. I want to read in only rows where a column 'PROFTYPE' has value of 'NURSEPRACT'.
I'm using Python 3.73, and I've read in the entire file, and then I trim it down by the column PROFTYPE afterward; but the reading in takes too long. I'd like to read in only those rows where PROFTYPE == 'NURSEPRACT'.
df_np = pd.read_excel(SourceFile, sheetname='Data', header=0)
df_np = df_np[df_np['PROFTYPE'] == 'NURSEPRACT']
This code actually works, but that's because I'm reading in the entire file first. I'm actually interested in reading in only those that meet the condition of PROFTYPE = 'NURSEPRACT'.
One idea is that you can
load only the 'PROFTYPE' column,
identify the non-nurse practitioner rows,
load the entire table to keep only the nurse practitioner rows.
Here that strategy is in action:
df = pd.read_excel(SourceFile,
sheet_name='Data',
header=0,
usecols=['PROFTYPE']) # <-- Load just 'PROFTYPE' of the following table
# ID PROFTYPE YEARS_IN_PRACTICE
# 1234 NURSEPRACT 12
# 43 NURSE 32
# 789 NURSEPRACT 4
# 34 PHYSICIAN 2
# 93 NURSEPRACT 13
row_numbers = [x+1 for x in df[df['PROFTYPE'] != 'NURSEPRACT'].index]
df = pd.read_excel(SourceFile, sheet_name='Data', header=0, skiprows=row_numbers)
# ID PROFTYPE YEARS_IN_PRACTICE
# 1234 NURSEPRACT 12
# 789 NURSEPRACT 4
# 93 NURSEPRACT 13

Finding the row number for the header row in a CSV file / Pandas Dataframe

I am trying to get an index or row number for the row that holds the headers in my CSV file.
The issue is, the header row can move up and down depending on the output of the report from our system (I have no control to change this)
code:
ht = pd.read_csv(file.csv)
test = ht.get_loc('Code') #Code being header im using to locate the header row
csv1 = read_csv(file.csv, header=test)
df1 = df1.append(csv1) #Appending as have many files
If I was to print test, I would expect a number around 4 or 5, and that's what I am feeding into the second read "read_csv"
The error I'm getting is that it's expecting 1 header column, but I have 26 columns. I am just trying to use the first header string to get the row number
Thanks
:-)
Edit:
CSV format
This file contains the data around the volume of items blablalbla
the deadlines for delivery of items a - z is 5 days
the deadlines for delivery of items aa through zz are 3 days
the deadlines for delivery of items aaa through zzz are 1 days
code,type,arrived_date,est_del_date
a/wrwgwr12/001,kids,12-dec-18,17-dec-18
aa/gjghgj35/030,pet,15-dec-18,18-dec-18
as you will see the "The deadlines" rows are the same, this can be 3 or 5 based on the code ids, thus the header row can change up or down.
I also did not write out all 26 column headers, not sure that matters.
Wanted DF format
index | code | type | arrived_date | est_del_date
1 | a/wrwgwr12/001 | kids | 12-dec-18 | 17-dec-18
2 | aa/gjghgj35/030 | Pet | 15-dec-18 | 18-dec-18
Hope this makes sense..
Thanks,
You can use the csv module to find the first row which contains a delimiter, then feed the index of this row as the skiprows parameter to pd.read_csv:
from io import StringIO
import csv
import pandas as pd
x = """This file contains the data around the volume of items blablalbla
the deadlines for delivery of items a - z is 5 days
the deadlines for delivery of items aa through zz are 3 days
the deadlines for delivery of items aaa through zzz are 1 days
code,type,arrived_date,est_del_date
a/wrwgwr12/001,kids,12-dec-18,17-dec-18
aa/gjghgj35/030,pet,15-dec-18,18-dec-18"""
# replace StringIO(x) with open('file.csv', 'r')
with StringIO(x) as fin:
reader = csv.reader(fin)
idx = next(idx for idx, row in enumerate(reader) if len(row) > 1) # 4
# replace StringIO(x) with 'file.csv'
df = pd.read_csv(StringIO(x), skiprows=idx)
print(df)
code type arrived_date est_del_date
0 a/wrwgwr12/001 kids 12-dec-18 17-dec-18
1 aa/gjghgj35/030 pet 15-dec-18 18-dec-18

Python extract columns with repetitive headers with pandas

I have a csv file with 900000 rows and 30 columns. The header is in the first row:
"Probe Set ID","dbSNP RS ID","Chromosome","Physical Position", etc...
I want to extract only certain columns using pandas.
Now my problem is that the header repeats itself every 50 rows or so, so when I extract the columns I get only the first 50 rows. How can get the complete columns while skipping all the headers but the first one?
This is the code I have so far, but works nicely only until the second header:
import pandas
data = pandas.read_csv('data1.csv', usecols = ['dbSNP RS ID', 'Physical Position'])
import sys
sys.stdout = open("data2.csv", "w")
print data
This is an example representing some rows of the extracted columns:
dbSNP RS ID Physical Position
0 rs4147951 66943738
1 rs2022235 14326088
2 rs6425720 31709555
3 rs12997193 106584554
4 rs9933410 82323721
...
48 rs5771794 49157118
49 rs1061497 1415331
50 rs12647065 136012580
dbSNP RS ID Physical Position
...
dbSNP RS ID Physical Position
...
and so on...
Thanks very much in advance !
You could read the file with header=None, drop the duplicate rows (which keeps the first per default), and then set the remaining first row as header like so:
df = read_csv(path, header=None).drop_duplicates()
df.columns = df.iloc[0]
df = df.iloc[1:]

Categories