Import CSV file where last column has many separators [duplicate] - python

This question already has an answer here:
python pandas read_csv delimiter in column data
(1 answer)
Closed 2 years ago.
The dataset looks like this:
region,state,latitude,longitude,status
florida,FL,27.8333,-81.717,open,for,activity
georgia,GA,32.9866,-83.6487,open
hawaii,HI,21.1098,-157.5311,illegal,stuff
iowa,IA,42.0046,-93.214,medical,limited
As you can see, the last column sometimes has separators in it. This makes it hard to import the CSV file in pandas using read_csv(). The only way I can import the file is by adding the parameter error_bad_lines=False to the function. But this way I'm losing some of the data.
How can I import the CSV file without losing data?

I would read the file as one single column and parse manually:
df = pd.read_csv(filename, sep='\t')
pat = ','.join([f'(?P<{x}>[^\,]*)' for x in ['region','state','latitude','longitute']])
pat = '^'+ pat + ',(?P<status>.*)$'
df = df.iloc[:,0].str.extract(pat)
Output:
region state latitude longitute status
0 florida FL 27.8333 -81.717 open,for,activity
1 georgia GA 32.9866 -83.6487 open
2 hawaii HI 21.1098 -157.5311 illegal,stuff
3 iowa IA 42.0046 -93.214 medical,limited

Have you tried the old-school technique with the split function? A major downside is that you'd end up losing data or bumping into errors if your data has a , in any of the first 4 fields/columns, but if not, you could use it.
data = open(file,'r').read().split('\n')
for line in data:
items = line.split(',',4). # Assuming there are 4 standard columns, and the 5th column has commas
Each row items would look, for example, like this:
['hawaii', 'HI', '21.1098', '-157.5311', 'illegal,stuff']

Related

How to split csv data

I have a problem where I got a csv data like this:
AgeGroup Where do you hear our company from? How long have you using our platform?
18-24 Word of mouth; Supermarket Product 0-1 years
36-50 Social Media; Word of mouth 1-2 years
18-24 Advertisement +4 years
and I tried to make the file into this format through either jupyter notebook or from excel csv:
AgeGroup Where do you hear our company from?
18-24 Word of mouth 0-1 years
18-24 Supermarket Product 0-1 years
36-50 Social Media 1-2 years
36-50 Word of mouth 1-2 years
18-24 Advertisement +4 years
Let's say the csv file is Untitled form.csv and I import the data to jupyter notebook:
data = pd.read_csv('Untitled form.csv')
Can anyone tell me how should I do it?
I have tried doing it in excel csv using data-column but of course, they only separate the data into column while what I wanted is the data is separated into a row while still pertain the data from other column
Anyway... I found another way to do it which is more roundabout. First I edit the file through PowerSource excel and save it to different file... and then if utf-8 encoding appear... I just add encoding='cp1252'
So it would become like this:
import pandas as pd
data_split = pd.read_csv('Untitled form split.csv',
skipinitialspace=True,
usecols=range(1,7),
encoding='cp1252')
However if there's a more efficient way, please let me know. Thanks
I'm not 100% sure about your question since I think it might be two separate issues but hopefully this should fix it.
import pandas as pd
data = pd.read_fwf('Untitled form.csv')
cols = data.columns
data_long = pd.DataFrame(columns=data.columns)
for idx, row in data.iterrows():
hear_from = row['Where do you hear our company from?'].split(';')
hear_from_fmt = list(map(lambda x: x.strip(), hear_from))
n_items = len(hear_from_fmt)
d = {
cols[0] : [row[0]]*n_items,
cols[1] : hear_from_fmt,
cols[2] : [row[2]]*n_items,
}
data_long = pd.concat([data_long, pd.DataFrame(d)], ignore_index=True)
Let's brake it down.
This line data = pd.read_fwf('Untitled form.csv') reads the file inferring the spacing between columns. Now this is only useful because I am not sure your file is a proper CSV, if it is, you can open it normally, if not that this might help.
Now for the rest. We are iterating through each row and we are selecting the methods someone could have heard your company from. These are split using ; and then "stripped" to ensure there are no spaces. A new temp dataframe is created where first and last column are the same but you have as many rows as the number of elements in the hear_from_fmt list there are. The dataframes are then concatenated together.
Now there might be a more efficient solution, but this should work.

How to extract only year(YYYY) from a CSV column with data like YYYY-YY

I am new to Python/Bokeh/Pandas.
I am able to plot line graph in pandas/bokeh using parse_date options.
However I have come across a dataset(.csv) where the column is like below
My code is as below which gives a blank graph if the column 'Year/Ports' is in YYYY-YY form like from 1952-53, 1953-54, 1954-55 etc.
Do I have to extract only the YYYY and plot because that works but I am sure that is not how the data is to be visualized.
If I extract only the YYYY using CSV or Notepad++ tools then there is no issue as the dates are read perfectly and I get a good meaningful line graph
#Total Cargo Handled at Mormugao Port from 1950-51 to 2019-20
import pandas as pd
from bokeh.plotting import figure,show
from bokeh.io import output_file
#read the CSV file shared by GOI
df = pd.read_csv("Cargo_Data_full.csv",parse_dates=["Year/Ports"])
# selecting rows based on condition
output_file("Cargo tracker.html")
f = figure(height=200,sizing_mode = 'scale_width',x_axis_type = 'datetime')
f.title.text = "Cargo Tracker"
f.xaxis.axis_label="Year/Ports"
f.yaxis.axis_label="Cargo handled"
f.line(df['Year/Ports'],df['OTHERS'])
show(f)
You can't use parse_dates in this case, since the format is not a valid datetime. You can use pandas string slicing to only keep the YYYY part.
df = pd.DataFrame({'Year/Ports':['1952-53', '1953-54', '1954-55'], 'val':[1,2,3]})
df['Year/Ports'] = df['Year/Ports'].str[:4]
print(df)
Year/Ports val
0 1952 1
1 1953 2
2 1954 3
From there you can turn it into a datetime if that makes sense for you.
df['Year/Ports'] = pd.to_datetime(df['Year/Ports'])
print(df)
Year/Ports val
0 1952-01-01 1
1 1953-01-01 2
2 1954-01-01 3

I need help formating this data

I have data like this
id,phonenumbers,firstname,lastname,email,birthday,gender,locale,hometown,location,link
The problem is some data is not in the format like this
000000,000000,name1,name2,email#email,1 1 1990,female,en_En,new york,USA ,new yourk,https://www.example.com
As you can see in the "local,hometown" there are 3 commas, I want to delete one of them so the data become like this
000000,000000,name1,name2,email#email,1 1 1990,female,en_En ,new york USA, new yourk,https://www.example.com
This is just an example to the problem in my data there could be more than 3 commas and different addresses
Essentially I want to load the data into excel and have it show up clean each column with the right data
The problem is that a value is split into multiple colums when it should be in one column. If this is only possible with one column but we have a fixed number of columns before and after, then it's possible to fix it:
testdata = "000000,000000,name1,name2,email#email,1 1 1990,female,en_En,new york,USA ,new yourk,https://www.example.com"
def split(data, cols_before_addr=8, cols_after_addr=1):
raw_cols = data.split(',')
return raw_cols[:cols_before_addr] \
+ ["\n".join(raw_cols[cols_before_addr:-cols_after_addr])] \
+ raw_cols[-cols_after_addr:]
print(split(testdata))

Pandas Check Multiple Conditions [duplicate]

This question already has answers here:
How do I select rows from a DataFrame based on column values?
(16 answers)
Closed 3 years ago.
I have a small excel file that contains prices for our online store & I am trying to automate this process, however, I don't fully trust the stuff to properly qualify the data, so I wanted to use Pandas to quickly check over certain fields, I have managed to achieve everything I need so far, however, I am only a beginner and I cannot think of the proper way for the next part.
So basically I need to qualify 2 columns on the same row, we have one column MARGIN, if this column is >60, then I need to check that the MARKDOWN column on the same row is populated == YES.
So my question is, how can I code it to basically say-
Below is an example of the way I have been doing my other checks, I realise it is quite beginner-ish, but I am only a beginner.
sku2 = df['SKU_2']
comp_at = df['COMPARE AT PRICE']
sales_price = df['SALES PRICE']
dni_act = df['DO NOT IMPORT - action']
dni_fur = df['DO NOT IMPORT - further details']
promo = df['PROMO']
replacement = df['REPLACEMENT']
go_live_date = df['go live date']
markdown = df['markdown']
# sales price not blank check
for item in sales_price:
if pd.isna(item):
with open('document.csv', 'a', newline="") as fd:
writer = csv.writer(fd)
writer.writerow(['It seems there is a blank sales price in here', str(file_name)])
fd.close
break
Example:
df = pd.DataFrame([
['a',1,2],
['b',3,4],
['a',5,6]],
columns=['f1','f2','f3'])
# | represents or
print(df[(df['f1'] == 'a') & (df['f2'] > 1)])
Output:
f1 f2 f3
2 a 5 6

could not convert string to float: '7,751.30' [duplicate]

This question already has answers here:
pandas reading CSV data formatted with comma for thousands separator
(3 answers)
Closed 5 years ago.
I get the TWSE price from Taiwan Stock Exchange.
df = pd.read_csv(r'C:\Stock\TWSE.csv',encoding='Big5')
df.head()
日期 開盤指數 最高指數 最低指數 收盤指數
0 96/02/01 7,751.30 7,757.63 7,679.78 7,701.54
1 96/02/02 7,754.16 7,801.63 7,751.53 7,777.03
2 96/02/05 7,786.77 7,823.94 7,772.05 7,783.12
3 96/02/06 7,816.30 7,875.75 7,802.94 7,875.75
4 96/02/07 7,894.77 7,894.77 7,850.06 7,850.06
df.loc[0][2]
'7,757.63'
type(df.loc[0][2])
str
I want to convert the str type to float type for the purpose of plotting.
But, I can not convert them. For example:
float(df.loc[0][2])
ValueError: could not convert string to float: '7,757.63'
pd.read_csv, much like almost every other pd.read_* function, has a thousands parameter you can set to ',' to make sure that you're importing those values as floats.
The following is an illustration:
import io
import pandas as pd
txt = '日期 開盤指數 最高指數 最低指數 收盤指數\n0 96/02/01 7,751.30 7,757.63 7,679.78 7,701.54\n1 96/02/02 7,754.16 7,801.63 7,751.53 7,777.03\n2 96/02/05 7,786.77 7,823.94 7,772.05 7,783.12\n3 96/02/06 7,816.30 7,875.75 7,802.94 7,875.75\n4 96/02/07 7,894.77 7,894.77 7,850.06 7,850.06'
with io.StringIO(txt) as f:
df = pd.read_table(f, encoding='utf8', header=0, thousands=',', sep='\s+')
print(df)
Yields:
日期 開盤指數 最高指數 最低指數 收盤指數
0 96/02/01 7751.30 7757.63 7679.78 7701.54
1 96/02/02 7754.16 7801.63 7751.53 7777.03
2 96/02/05 7786.77 7823.94 7772.05 7783.12
3 96/02/06 7816.30 7875.75 7802.94 7875.75
4 96/02/07 7894.77 7894.77 7850.06 7850.06
I hope this proves helpful.
float(df.loc[0][2].replace(',',''))

Categories