I have a tab-delimited file with data like:
id Name address dept sal
1 abc "bangalore,
Karnataka,
Inida" 10 500
2 xyz "Hyderabad
Inida" 20 500
Here the columns are id,Name,address,dept, and sal.
The issue is with address columns that can contain a new line character. I tried different methods to read the file using Pandas and Python but instead of two rows, I am getting multiple rows as output.
Here are the few commands I tried:
file1 = open('C:/dummy/dummy.csv', 'r')
lines = file1.readlines()
for i in lines:
print(i)
and
df = pd.read_csv("C:/dummy/dummy.csv",sep='\t',quotechar='"')
Can anyone please help?
df = pd.read_csv("C:/dummy/dummy.csv",sep='\t',quotechar='"')
The corresponding output is, in case the columns are tab-delimited in the csv-file, as you say
id Name address dept sal
0 1 abc bangalore,\r\nKarnataka,\r\nInida 10 500
1 2 xyz Hyderabad\r\nInida 20 500
If you like to remove the CR-LF within the string, you can remove them via post-processing.
Additionally you could define the index-column via
df = pd.read_csv("C:/dummy/dummy.csv",sep='\t',quotechar='"',index_col=0)
What is your desired/expected output?
I have extracted user_id against shop_ids as pandas dataframe from database using SQL query.
user_id shop_ids
0 022221205 541
1 023093087 5088,4460,4460,4460,4460,4460,4460,4460,5090
2 023096023 2053,2053,2053,2053,2053,2053,2053,2053,2053,1...
3 023096446 4339,4339,3966,4339,4339
4 023098684 5004,3604,5004,5749,5004
I am trying to write this dataframe into csv using:
df.to_csv('users_ordered_shops.csv')
I end up with the csv merging the shop ids into one number as such:
user_id shop_ids
0 22221205 541
1 23093087 508,844,604,460,446,000,000,000,000,000,000,000
2 23096023 2,053,205,320,532,050,000,000,000,000,000,000,000,000,000,000,000,000
3 23096446 43,394,339,396,643,300,000
4 23098684 50,043,604,500,457,400,000
The values for index 2 are:
print(df.iloc[2].shop_ids)
2053,2053,2053,2053,2053,2053,2053,2053,2053,1294,1294,2053,1922
Expected output is a csv file with all shop_ids intact in one column or different columns like:
user_id shop_ids
0 022221205 541
1 023093087 5088,4460,4460,4460,4460,4460,4460,4460,5090
2 023096023 2053,2053,2053,2053,2053,2053,2053,2053,2053,1294,1294,2053,1922
3 023096446 4339,4339,3966,4339,4339
4 023098684 5004,3604,5004,5749,5004
Any tips on how to get the shop ids without merging when writing to a csv file? I have tried converting the shop_ids column using astype() to int and str which has resulted in the same output.
Update
To get one shop per column (and remove duplicates), you can use:
pd.concat([df['user_id'],
df['shop_ids'].apply(lambda x: sorted(set(x.split(','))))
.apply(pd.Series)],
axis=1).to_csv('users_ordered_shops.csv', index=False)
Change the delimiter. Try:
df.to_csv('users_ordered_shops.csv', sep=';')
Or change the quoting strategy:
import csv
df.to_csv('users_ordered_shops.csv', quoting=csv.QUOTE_NONNUMERIC)
I need to create a pandas dataframe in Python by reading in an Excel spreadsheet that contains almost 50,000 rows and 81 columns. The file contains information about medical professionals of all kinds: physicians, nurses, nurse practitioners, etc. I want to read in only rows where a column 'PROFTYPE' has value of 'NURSEPRACT'.
I'm using Python 3.73, and I've read in the entire file, and then I trim it down by the column PROFTYPE afterward; but the reading in takes too long. I'd like to read in only those rows where PROFTYPE == 'NURSEPRACT'.
df_np = pd.read_excel(SourceFile, sheetname='Data', header=0)
df_np = df_np[df_np['PROFTYPE'] == 'NURSEPRACT']
This code actually works, but that's because I'm reading in the entire file first. I'm actually interested in reading in only those that meet the condition of PROFTYPE = 'NURSEPRACT'.
One idea is that you can
load only the 'PROFTYPE' column,
identify the non-nurse practitioner rows,
load the entire table to keep only the nurse practitioner rows.
Here that strategy is in action:
df = pd.read_excel(SourceFile,
sheet_name='Data',
header=0,
usecols=['PROFTYPE']) # <-- Load just 'PROFTYPE' of the following table
# ID PROFTYPE YEARS_IN_PRACTICE
# 1234 NURSEPRACT 12
# 43 NURSE 32
# 789 NURSEPRACT 4
# 34 PHYSICIAN 2
# 93 NURSEPRACT 13
row_numbers = [x+1 for x in df[df['PROFTYPE'] != 'NURSEPRACT'].index]
df = pd.read_excel(SourceFile, sheet_name='Data', header=0, skiprows=row_numbers)
# ID PROFTYPE YEARS_IN_PRACTICE
# 1234 NURSEPRACT 12
# 789 NURSEPRACT 4
# 93 NURSEPRACT 13
I am trying to get an index or row number for the row that holds the headers in my CSV file.
The issue is, the header row can move up and down depending on the output of the report from our system (I have no control to change this)
code:
ht = pd.read_csv(file.csv)
test = ht.get_loc('Code') #Code being header im using to locate the header row
csv1 = read_csv(file.csv, header=test)
df1 = df1.append(csv1) #Appending as have many files
If I was to print test, I would expect a number around 4 or 5, and that's what I am feeding into the second read "read_csv"
The error I'm getting is that it's expecting 1 header column, but I have 26 columns. I am just trying to use the first header string to get the row number
Thanks
:-)
Edit:
CSV format
This file contains the data around the volume of items blablalbla
the deadlines for delivery of items a - z is 5 days
the deadlines for delivery of items aa through zz are 3 days
the deadlines for delivery of items aaa through zzz are 1 days
code,type,arrived_date,est_del_date
a/wrwgwr12/001,kids,12-dec-18,17-dec-18
aa/gjghgj35/030,pet,15-dec-18,18-dec-18
as you will see the "The deadlines" rows are the same, this can be 3 or 5 based on the code ids, thus the header row can change up or down.
I also did not write out all 26 column headers, not sure that matters.
Wanted DF format
index | code | type | arrived_date | est_del_date
1 | a/wrwgwr12/001 | kids | 12-dec-18 | 17-dec-18
2 | aa/gjghgj35/030 | Pet | 15-dec-18 | 18-dec-18
Hope this makes sense..
Thanks,
You can use the csv module to find the first row which contains a delimiter, then feed the index of this row as the skiprows parameter to pd.read_csv:
from io import StringIO
import csv
import pandas as pd
x = """This file contains the data around the volume of items blablalbla
the deadlines for delivery of items a - z is 5 days
the deadlines for delivery of items aa through zz are 3 days
the deadlines for delivery of items aaa through zzz are 1 days
code,type,arrived_date,est_del_date
a/wrwgwr12/001,kids,12-dec-18,17-dec-18
aa/gjghgj35/030,pet,15-dec-18,18-dec-18"""
# replace StringIO(x) with open('file.csv', 'r')
with StringIO(x) as fin:
reader = csv.reader(fin)
idx = next(idx for idx, row in enumerate(reader) if len(row) > 1) # 4
# replace StringIO(x) with 'file.csv'
df = pd.read_csv(StringIO(x), skiprows=idx)
print(df)
code type arrived_date est_del_date
0 a/wrwgwr12/001 kids 12-dec-18 17-dec-18
1 aa/gjghgj35/030 pet 15-dec-18 18-dec-18
I have a csv file with 900000 rows and 30 columns. The header is in the first row:
"Probe Set ID","dbSNP RS ID","Chromosome","Physical Position", etc...
I want to extract only certain columns using pandas.
Now my problem is that the header repeats itself every 50 rows or so, so when I extract the columns I get only the first 50 rows. How can get the complete columns while skipping all the headers but the first one?
This is the code I have so far, but works nicely only until the second header:
import pandas
data = pandas.read_csv('data1.csv', usecols = ['dbSNP RS ID', 'Physical Position'])
import sys
sys.stdout = open("data2.csv", "w")
print data
This is an example representing some rows of the extracted columns:
dbSNP RS ID Physical Position
0 rs4147951 66943738
1 rs2022235 14326088
2 rs6425720 31709555
3 rs12997193 106584554
4 rs9933410 82323721
...
48 rs5771794 49157118
49 rs1061497 1415331
50 rs12647065 136012580
dbSNP RS ID Physical Position
...
dbSNP RS ID Physical Position
...
and so on...
Thanks very much in advance !
You could read the file with header=None, drop the duplicate rows (which keeps the first per default), and then set the remaining first row as header like so:
df = read_csv(path, header=None).drop_duplicates()
df.columns = df.iloc[0]
df = df.iloc[1:]