I have the following sample csv file:
'TEXT';'DATE'
'hello';'20/02/2002'
'hello!
how are you?';'21/02/2002'
So, as you can see, the separator between columns is ; and the content of each column is delimited by '. This brings me problems when processing the file with pandas, because it uses line breaks as a delimiter between rows. That is, it interprets the line break between "hello!" and "how are you" as a separator between rows.
So what I would need is to remove the newlines within the content of each column, so that the file looks like this:
'TEXT';'DATE'
'hello';'20/02/2002'
'hello! how are you?';'21/02/2002'
Removing the r'\n sequence would not work, because then I would lose the row separation.
What can I try? I'm using Teradata SQL Assistant to generate the csv file.
You can use sep= and quotechar= parameters in pd.read_csv:
df = pd.read_csv('your_file.csv', sep=';', quotechar="'")
print(df)
Prints:
TEXT DATE
0 hello 20/02/2002
1 hello!\r\nhow are you? 21/02/2002
If you want to further replace the newlines:
df['TEXT'] = df['TEXT'].str.replace('\r', '').str.replace('\n', ' ')
print(df)
Prints:
TEXT DATE
0 hello 20/02/2002
1 hello! how are you? 21/02/2002
Related
Using Pandas, I'm trying to extract value using the key but I keep failing to do so. Could you help me with this?
There's a csv file like below:
value
"{""id"":""1234"",""currency"":""USD""}"
"{""id"":""5678"",""currency"":""EUR""}"
I imported this file in Pandas and made a DataFrame out of it:
dataframe from a csv file
However, when I tried to extract the value using a key (e.g. df["id"]), I'm facing an error message.
I'd like to see a value 1234 or 5678 using df["id"]. Which step should I take to get it done? This may be a very basic question but I need your help. Thanks.
The csv file isn't being read in correctly.
You haven't set a delimiter; pandas can automatically detect a delimiter but hasn't done so in your case. See the read_csv documentation for more on this. Because the , the pandas dataframe has a single column, value, which has entire lines from your file as individual cells - the first entry is "{""id"":""1234"",""currency"":""USD""}". So, the file doesn't have a column id, and you can't select data by id.
The data aren't formatted as a pandas df, with row titles and columns of data. One option is to read in this data is to manually process each row, though there may be slicker options.
file = 'test.dat'
f = open(file,'r')
id_vals = []
currency = []
for line in f.readlines()[1:]:
## remove obfuscating characters
for c in '"{}\n':
line = line.replace(c,'')
line = line.split(',')
## extract values to two lists
id_vals.append(line[0][3:])
currency.append(line[1][9:])
You just need to clean up the CSV file a little and you are good. Here is every step:
# open your csv and read as a text string
with open('My_CSV.csv', 'r') as f:
my_csv_text = f.read()
# remove problematic strings
find_str = ['{', '}', '"', 'id:', 'currency:','value']
replace_str = ''
for i in find_str:
my_csv_text = re.sub(i, replace_str, my_csv_text)
# Create new csv file and save cleaned text
new_csv_path = './my_new_csv.csv' # or whatever path and name you want
with open(new_csv_path, 'w') as f:
f.write(my_csv_text)
# Create pandas dataframe
df = pd.read_csv('my_new_csv.csv', sep=',', names=['ID', 'Currency'])
print(df)
Output df:
ID Currency
0 1234 USD
1 5678 EUR
You need to extract each row of your dataframe using json.loads() or eval()
something like this:
import json
for row in df.iteritems():
print(json.loads(row.value)["id"])
# OR
print(eval(row.value)["id"])
I have a long text where I have inserted a delimiter ";" exactly where I would like to split the text into different columns.
So far, whenever I try to split the text into 'ID' and 'ADText' I only get the first line. However there should be 1439 lines/rows in two columns.
My text looks like this:
1234; text in written from with multiple sentences going over multiple lines until at some point the next ID is written dwon 2345; then the new Ad-Text begins until the next ID 3456; and so on
I want to use the ; to split my text into two Columns, one with ID and one with the AD Text.
#read the text file into python:
jobads= pd.read_csv("jobads.txt", header=None)
print(jobadsads)
#create dataframe
df=pd.DataFrame(jobads, index=None, columns=None)
type(df)
print(df)
#name column to target it for split
df = df.rename(columns={0:"Job"})
print(df)
#split it into two columns. Problem: I only get the first row.
print(pd.DataFrame(dr.Job.str.split(';',1).tolist(),
columns=['ID','AD']))
Unfortunately that only works for the first entry and then it stops. The output looks like this:
ID AD
0 1234 text in written from with ...
Where am I going wrong? I would appreciate any advise =)
Thank you!
sample text:
FullName;ISO3;ISO1;molecular_weight
Alanine;Ala;A;89.09
Arginine;Arg;R;174.20
Asparagine;Asn;N;132.12
Aspartic_Acid;Asp;D;133.10
Cysteine;Cys;C;121.16
Create columns based on ";" separator:
import pandas as pd
f = "aminoacids"
df = pd.read_csv(f,sep=";")
EDIT: Considering the comment I assume the text looks more something like this:
t = """1234; text in written from with multiple sentences going over multiple lines until at some point the next ID is written dwon 2345; then the new Ad-Text begins until the next ID 3456; and so on1234; text in written from with multiple """
In this case regex like this will split your string into ids and text which you can then use to generate a pandas dataframe.
import re
r = re.compile("([0-9]+);")
re.split(r,t)
Output:
['',
'1234',
' text in written from with multiple sentences going over multiple lines until at some point the next ID is written dwon ',
'2345',
' then the new Ad-Text begins until the next ID ',
'3456',
' and so on',
'1234',
' text in written from with multiple ']
EDIT 2:
This is a response to questioners additional question in the comments:
How to convert this string to a pandas dataframe with 2 columns: IDs and Texts
import pandas as pd
# a is the output list from the previous part of this answer
# Create list of texts. ::2 takes every other item from a list, starting with the FIRST one.
texts = a[::2][1:]
print(texts)
# Create list of ID's. ::1 takes every other item from a list, starting with the SECOND one
ids = a[1::2]
print(ids)
df = pd.DataFrame({"IDs":ids,"Texts":texts})
I have one csv file in which a lot of columns are there.
After reading the csv file while printing the columns, its printing the col name as a full string not separate col name.
I need separate column name. Can you please help me how to do this?
code:
df1 = pd.read_csv("D:/Users/SPate233/Downloads/iMedical/AMPIL_DEV/MDM_PRODUCT_VIEW/MDM_PRODUCT_VIEW_H.csv", sep = '|')
print(list(df1.columns))
print(df1['SERIES_ID'][2])
Output:
['RECORD_ID,MDM_ID,SERIES_ID,RELTIO_ID,COUNTRY_ID,PRODUCT_NAME,GROUP_TYPE,JANSSEN_MSTR_PRDCT_NM']
KeyError: SERIES_ID
Desired Output:
['RECORD_ID','MDM_ID','SERIES_ID','RELTIO_ID','COUNTRY_ID','PRODUCT_NAME','GROUP_TYPE','JANSSEN_MSTR_PRDCT_NM']
Looks like you entered the wrong separator, so it's reading the entire first line as a single column. try:
df1 = pd.read_csv("D:/Users/SPate233/Downloads/iMedical/AMPIL_DEV/MDM_PRODUCT_VIEW/MDM_PRODUCT_VIEW_H.csv", sep = ',')
I am having issue trying to read csv file with pandas as the data are within quotes and whitespaces are present.
The header row in csv file is "Serial No,First Name,Last Name,Country".
Example data of each row is "1 ,""David, T "",""Barnes "",""USA """.
Below is the code I have tried thus far trying to remove the quotes and reading the text that are within 2 quotes.
import pandas as pd
import csv
df = pd.read_csv('file1.csv', sep=',', encoding='ansi', quotechar='"', quoting=csv.QUOTE_NONNUMERIC, doublequote=True, engine="python")
Is there a way to pre-process the file so that the result is as follows?
Serial No, First Name, Last Name, Country
1, David,T, Barnes, USA
Try using this.
file1 = pd.read_csv('sample.txt',sep=',\s+',skipinitialspace=True,quoting=csv.QUOTE_ALL,engine=python)
Closing this as I am using editpad to replace the commas and removing quotes as a walk-around.
I have exported a comma separated value file from a MSQL database (rpt-file ending). It only has two columns and 8 rows. Looking at the file in notepad everything looks OK. I tried to load the data into a pandas data frame using the code below:
import pandas as pd
with open('file.csv', 'r') as csvfile:
df_data = pd.read_csv(csvfile, sep=',' , encoding = 'utf-8')
print(df_data)
When printing to console the first column header name is wrong with some extra characters,  , at the start of column 1. I get no errors but obviously the first column is decoded wrongly in my code:Image of output
Anyone have any ideas on how to get this right?
Here's one possible option: Fix those headers after loading them in:
df.columns = [x.encode('utf-8').decode('ascii', 'ignore') for x in df.columns]
The str.encode followed by the str.decode call will drop those special characters, leaving only the ones in ASCII range behind:
>>> 'aSA'.encode('utf-8').decode('ascii', 'ignore')
'aSA'