I have a tab-delimited file with data like:
id Name address dept sal
1 abc "bangalore,
Karnataka,
Inida" 10 500
2 xyz "Hyderabad
Inida" 20 500
Here the columns are id,Name,address,dept, and sal.
The issue is with address columns that can contain a new line character. I tried different methods to read the file using Pandas and Python but instead of two rows, I am getting multiple rows as output.
Here are the few commands I tried:
file1 = open('C:/dummy/dummy.csv', 'r')
lines = file1.readlines()
for i in lines:
print(i)
and
df = pd.read_csv("C:/dummy/dummy.csv",sep='\t',quotechar='"')
Can anyone please help?
df = pd.read_csv("C:/dummy/dummy.csv",sep='\t',quotechar='"')
The corresponding output is, in case the columns are tab-delimited in the csv-file, as you say
id Name address dept sal
0 1 abc bangalore,\r\nKarnataka,\r\nInida 10 500
1 2 xyz Hyderabad\r\nInida 20 500
If you like to remove the CR-LF within the string, you can remove them via post-processing.
Additionally you could define the index-column via
df = pd.read_csv("C:/dummy/dummy.csv",sep='\t',quotechar='"',index_col=0)
What is your desired/expected output?
Related
I have been trying to convert 3 csv files with related keys into a single JSON file using Python.
Originally, I had tried using SAS but noticed the proc required (I believe) all data to be available in a single row. I was unable to recreate an array containing multiple customers or warehouses against a single sale.
The challenge I am facing is the 1st csv is a unique set of data points there are no duplicates. The 2nd csv links back to the 1st via and this creates duplicate rows, this is the same for the 3rd csv.
The 3rd csv is a 1 to many relationship with the 1st and the 2nd csv has a 0 to 1 to many relationship with the 1st.
The format of the 3 csv files is as follows:
CSV 1 - single row for each unique ID
saleID
ProductName
1
A
2
B
CSV2 - can have duplicates 1 to many relationship with csv1
WarehouseID
saleID
WarehouseName
1
1
A
2
2
B
1
3
A
CSV3 - can have duplicates 1 to many relationship with csv1
customerID
saleID
CustomerName
1
1
Albert
2
2
Bob
3
1
Cath
The expected format of the JSON would be something like this.
{
"totalSales":2,
"Sales":[{
"saleId":1,
"productName":"A",
"warehouse":[{
"warehouseID":1,
"warehouseName":"A"
}],
"customer":[{
"customerID":1,
"customerName":"Albert"
},
"customerID":3,
"customerName":"Cath"
}],
"Sales":[{
"saleId":2,
"productName":"B",
"warehouse":[{
"warehouseID":2,
"warehouseName":"B"
}],
"customer":[{
"customerID":2,
"customerName":"Bob"
}]
}
What i've tried so far in python seems to have a similar result as what i achieved in SAS as i think im missing the step to capture the warehouse and customer information as an array.
def multicsvtojson():
salesdf = pandas.read_csv('C:\\Python\\multiCSVtoJSON\\sales.csv', names=("salesID","ProductName"))
warehousedf = pandas.read_csv('C:\\Python\\multiCSVtoJSON\\warehouse.csv', names=("warehouseID", "salesID", "warehouseName"))
customerdf = pandas.read_csv('C:\\Python\\multiCSVtoJSON\\customers.csv', names=("customerID", "salesID", "customerName"))
finaldf = pd.merge(pd.merge(salesdf, warehousedf, on='salesID'), customerdf, on='salesID')
finaldf.to_json('finalResult.json', orient='records')
print(finaldf)
results
[{"salesID":"saleID","ProductName":"productName","warehouseID":"warehouseID","warehouseName":"warehouseName","customerID":"customerID","customerName":"productName"},
{"salesID":"1","ProductName":"A","warehouseID":"1","warehouseName":"A","customerID":"1","customerName":"Albert"},
{"salesID":"1","ProductName":"A","warehouseID":"1","warehouseName":"A","customerID":"3","customerName":"Cath"},
{"salesID":"2","ProductName":"B","warehouseID":"2","warehouseName":"B","customerID":"2","customerName":"Bob"}]
I have extracted user_id against shop_ids as pandas dataframe from database using SQL query.
user_id shop_ids
0 022221205 541
1 023093087 5088,4460,4460,4460,4460,4460,4460,4460,5090
2 023096023 2053,2053,2053,2053,2053,2053,2053,2053,2053,1...
3 023096446 4339,4339,3966,4339,4339
4 023098684 5004,3604,5004,5749,5004
I am trying to write this dataframe into csv using:
df.to_csv('users_ordered_shops.csv')
I end up with the csv merging the shop ids into one number as such:
user_id shop_ids
0 22221205 541
1 23093087 508,844,604,460,446,000,000,000,000,000,000,000
2 23096023 2,053,205,320,532,050,000,000,000,000,000,000,000,000,000,000,000,000
3 23096446 43,394,339,396,643,300,000
4 23098684 50,043,604,500,457,400,000
The values for index 2 are:
print(df.iloc[2].shop_ids)
2053,2053,2053,2053,2053,2053,2053,2053,2053,1294,1294,2053,1922
Expected output is a csv file with all shop_ids intact in one column or different columns like:
user_id shop_ids
0 022221205 541
1 023093087 5088,4460,4460,4460,4460,4460,4460,4460,5090
2 023096023 2053,2053,2053,2053,2053,2053,2053,2053,2053,1294,1294,2053,1922
3 023096446 4339,4339,3966,4339,4339
4 023098684 5004,3604,5004,5749,5004
Any tips on how to get the shop ids without merging when writing to a csv file? I have tried converting the shop_ids column using astype() to int and str which has resulted in the same output.
Update
To get one shop per column (and remove duplicates), you can use:
pd.concat([df['user_id'],
df['shop_ids'].apply(lambda x: sorted(set(x.split(','))))
.apply(pd.Series)],
axis=1).to_csv('users_ordered_shops.csv', index=False)
Change the delimiter. Try:
df.to_csv('users_ordered_shops.csv', sep=';')
Or change the quoting strategy:
import csv
df.to_csv('users_ordered_shops.csv', quoting=csv.QUOTE_NONNUMERIC)
I have a wireless radio readout that basically dumps all of the data into one column (column 'A') a of a spreadsheet (.xlsx). Is there anyway to parse the twenty plus columns into a dataframe for pandas? This is example of the data that is in column A of the excel file:
DSP ALLMSINFO:SECTORID=0,CARRIERID=0;
Belgium351G
+++ HUAWEI 2020-04-03 10:04:47 DST
O&M #4421590
%%/*35687*/DSP ALLMSINFO:SECTORID=0,CARRIERID=0;%%
RETCODE = 0 Operation succeeded
Display Information of All MSs-
------------------------------
Sector ID Carrier ID MSID MSSTATUS MSPWR(dBm) DLCINR(dB) ULCINR(dB) DLRSSI(dBm) ULRSSI(dBm) DLFEC ULFEC DLREPETITIONFATCTOR ULREPETITIONFATCTOR DLMIMOFLAG BENUM NRTPSNUM RTPSNUM ERTPSNUM UGSNUM UL PER for an MS(0.001) NI Value of the Band Where an MS Is Located(dBm) DL Traffic Rate for an MS(byte/s) UL Traffic Rate for an MS(byte/s)
0 0 0011-4D10-FFBA Enter -2 29 27 -56 -107 21 20 0 0 MIMO B 2 0 0 0 0 0 -134 158000 46000
0 0 501F-F63B-FB3B Enter 13 27 28 -68 -107 21 20 0 0 MIMO A 2 0 0 0 0 0 -134 12 8
Basically I just want to parse this data and have the table in a dataframe. Any help would be greatly appreciated.
You could try pandas read excel
df = pd.read_excel(filename, skip_rows=9)
This assumes we want to ignore the first 9 rows that don't make up the dataframe! Docs here https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html
Load the excel file and split the column on the spaces.
A problem may occur with "DLMIMOFLAG" because it has a space in the data and this will cause it to be split over two columns. It's optional whether this is acceptable or if the columns are merged back together afterwards.
Add the header manually rather than load it, otherwise all the spaces in the header will confuse the loading & splitting routines.
import numpy as np
import pandas as pd
# Start on the first data row - row 10
# Make sure pandas knows that only data is being loaded by using
# header=None
df = pd.read_excel('radio.xlsx', skiprows=10, header=None)
This gives a dataframe that is only data, all held in one column.
To split these out, make sure pandas has a reference to the first column with df.iloc[:,0], split the column based on spaces with str.split() and inform pandas the output will be a numpy list values.tolist().
Together this looks like:
df2 = pd.DataFrame(df.iloc[:,0].str.split().values.tolist())
Note the example given has an extra column because of the space in "DLMIMOFLAG" causing it to be split over two columns. This will be referred to as "DLMIMOFLAG_A" and "DLMIMOFLAG_B".
Now add on the column headers.
Optionally create a list first.
column_names = ["Sector ID", "Carrier ID", "MSID", "MSSTATUS", "MSPWR(dBm)", "DLCINR(dB)", "ULCINR(dB)",
"DLRSSI(dBm)", "ULRSSI(dBm)", "DLFEC", "ULFEC", "DLREPETITIONFATCTOR", "ULREPETITIONFATCTOR",
"DLMIMOFLAG_A", "DLMIMOFLAG_B", "BENUM", "NRTPSNUM", "RTPSNUM", "ERTPSNUM", "UGSNUM",
"UL PER for an MS(0.001)", "NI Value of the Band Where an MS Is Located(dBm)",
"DL Traffic Rate for an MS(byte/s)", "UL Traffic Rate for an MS(byte/s)",]
df2.columns = column_names
This gives the output as a full dataframe with column headers.
Sector ID Carrier ID MSID MSSTATUS
0 0 0011-4D10-FFBA Enter
0 0 501F-F63B-FB3B Enter
I need to upload multiple excel files - each one has a name of starting date. Eg. "20190114".
Then I need to append them in one DataFrame.
For this, I use the following code:
all_data = pd.DataFrame()
for f in glob.glob('C:\\path\\*.xlsx'):
df = pd.read_excel(f)
all_data = all_data.append(df,ignore_index=True)
In fact, I do not need all data, but filtered by multiple columns.
Then, I would like to create an additional column ('from') with values of file name (which is "date") for each respective file.
Example:
Data from the excel file, named '20190101'
Data from the excel file, named '20190115'
The final dataframe must have values in 'price' column not equal to '0' and in code column - with code='r' (I do not know if it's possible to export this data already filtered, avoiding exporting huge volume of data?) and then I need to add a column 'from' with the respective date coming from file's name:
like this:
dataframes for trial:
import pandas as pd
df1 = pd.DataFrame({'id':['id_1', 'id_2','id_3', 'id_4','id_5'],
'price':[0,12.5,17.5,24.5,7.5],
'code':['r','r','r','c','r'] })
df2 = pd.DataFrame({'id':['id_1', 'id_2','id_3', 'id_4','id_5'],
'price':[7.5,24.5,0,149.5,7.5],
'code':['r','r','r','c','r'] })
IIUC, you can filter necessary rows ,then concat, for file name you can use os.path.split() and access the filename with string slicing:
l=[]
for f in glob.glob('C:\\path\\*.xlsx'):
df=pd.read_excel(f)
df['from']=os.path.split(f)[1][:-5]
l.append(df[(df['code'].eq('r')&df['price'].ne(0))])
pd.concat(l,ignore_index=True)
id price code from
0 id_2 12.5 r 20190101
1 id_3 17.5 r 20190101
2 id_5 7.5 r 20190101
3 id_1 7.5 r 20190115
4 id_2 24.5 r 20190115
5 id_5 7.5 r 20190115
I am trying to get an index or row number for the row that holds the headers in my CSV file.
The issue is, the header row can move up and down depending on the output of the report from our system (I have no control to change this)
code:
ht = pd.read_csv(file.csv)
test = ht.get_loc('Code') #Code being header im using to locate the header row
csv1 = read_csv(file.csv, header=test)
df1 = df1.append(csv1) #Appending as have many files
If I was to print test, I would expect a number around 4 or 5, and that's what I am feeding into the second read "read_csv"
The error I'm getting is that it's expecting 1 header column, but I have 26 columns. I am just trying to use the first header string to get the row number
Thanks
:-)
Edit:
CSV format
This file contains the data around the volume of items blablalbla
the deadlines for delivery of items a - z is 5 days
the deadlines for delivery of items aa through zz are 3 days
the deadlines for delivery of items aaa through zzz are 1 days
code,type,arrived_date,est_del_date
a/wrwgwr12/001,kids,12-dec-18,17-dec-18
aa/gjghgj35/030,pet,15-dec-18,18-dec-18
as you will see the "The deadlines" rows are the same, this can be 3 or 5 based on the code ids, thus the header row can change up or down.
I also did not write out all 26 column headers, not sure that matters.
Wanted DF format
index | code | type | arrived_date | est_del_date
1 | a/wrwgwr12/001 | kids | 12-dec-18 | 17-dec-18
2 | aa/gjghgj35/030 | Pet | 15-dec-18 | 18-dec-18
Hope this makes sense..
Thanks,
You can use the csv module to find the first row which contains a delimiter, then feed the index of this row as the skiprows parameter to pd.read_csv:
from io import StringIO
import csv
import pandas as pd
x = """This file contains the data around the volume of items blablalbla
the deadlines for delivery of items a - z is 5 days
the deadlines for delivery of items aa through zz are 3 days
the deadlines for delivery of items aaa through zzz are 1 days
code,type,arrived_date,est_del_date
a/wrwgwr12/001,kids,12-dec-18,17-dec-18
aa/gjghgj35/030,pet,15-dec-18,18-dec-18"""
# replace StringIO(x) with open('file.csv', 'r')
with StringIO(x) as fin:
reader = csv.reader(fin)
idx = next(idx for idx, row in enumerate(reader) if len(row) > 1) # 4
# replace StringIO(x) with 'file.csv'
df = pd.read_csv(StringIO(x), skiprows=idx)
print(df)
code type arrived_date est_del_date
0 a/wrwgwr12/001 kids 12-dec-18 17-dec-18
1 aa/gjghgj35/030 pet 15-dec-18 18-dec-18