prevent pandas from removing spaces in numbers in text columns - python

I'm trying to load CSV file into pandas dataframe. CSV is semicolon delimited. Values in text columns are in double quotation marks.
File in question: https://www.dropbox.com/s/1xv391gebjzmmco/file_01.csv?dl=0
In one of the text columns ('TYTUL') i have following value:
"00 307 1457 212"
I specify the column as str but when i print or export results to excel I get
003071457212
instead of
00 307 1457 212
How do I prevent pandas from removing spaces?
Here is my code:
import pandas
df = pandas.read_csv(r'file_01.csv'
,sep = ';'
,quotechar = '"'
,names = ['DATA_OPERACJI'
,'DATA_KSIEGOWANIA'
,'OPIS_OPERACJI'
,'TYTUL'
,'NADAWCA_ODBIORCA'
,'NUMER_KONTA'
,'KWOTA'
,'SALDO_PO_OPERACJI'
,'KOLUMNA_9']
,usecols = [0,1,2,3,4,5,6,7]
,skiprows = 38
,skipfooter = 3
,encoding = 'cp1250'
,thousands = ' '
,decimal = ','
,parse_dates = [0,1]
,converters = {'OPIS_OPERACJI': str
,'TYTUL': str
,'NADAWCA_ODBIORCA': str
,'NUMER_KONTA': str}
,engine = 'python'
)
df.TYTUL.replace([' +', '^ +', ' +$'], [' ', '', ''],regex=True,inplace=True) #this only removes excessive spaces
print(df.TYTUL)
I also came up with a workaround (comment #workaround) but I would like to ask if there is a better way.
import pandas
df = pandas.read_csv(r'file_01.csv'
,sep = ';'
,quotechar = '?' #workaround
,names = ['DATA_OPERACJI'
,'DATA_KSIEGOWANIA'
,'OPIS_OPERACJI'
,'TYTUL'
,'NADAWCA_ODBIORCA'
,'NUMER_KONTA'
,'KWOTA'
,'SALDO_PO_OPERACJI'
,'KOLUMNA_9']
,usecols = [0,1,2,3,4,5,6,7]
,skiprows = 38
,skipfooter = 3
,encoding = 'cp1250'
,thousands = ' '
,decimal = ','
,parse_dates = [0,1]
,converters = {'OPIS_OPERACJI': str
,'TYTUL': str
,'NADAWCA_ODBIORCA': str
,'NUMER_KONTA': str}
,engine = 'python'
)
df.TYTUL.replace([' +', '^ +', ' +$'], [' ', '', ''],regex=True,inplace=True) #this only removes excessive spaces
df.TYTUL.replace(['^"', '"$'], ['', ''],regex=True,inplace=True) #workaround
print(df.TYTUL)

remove this line from your read_csv code
,thousands = ' '
I tested it, the output is correct without this option
'00 307 1457 212'

Related

Parse list to get new list with same structure

I applied a previous code for a log, to get the following list
log = ['',
'',
'ABC KLSC: XYZ',
'',
'some text',
'some text',
'%%ABC KLSC: XYZ',
'some text',
'',
'ID = 5',
'TME = KRE',
'DDFFLE = SOFYU',
'QWWRTYA = GRRZNY',
'',
'some text',
'-----------------------------------------------',
'',
'QUWERW WALS RUSZ CRORS ELME',
'P <NULL> R 98028',
'P <NULL> R 30310',
'',
'',
'Some text',
'',
'Some text',
'',
'--- FINISH'
]
and I want to filter those lines in order to get a list with only the lines that contains "=" and the
lines that are ordered in columns format (those below headers QUWERW, WALS, RUSZ, CRORS), but additionally, for those lines with column format, store
each value with its corresponding header.
I was able to filter the desired lines with code below (not sure here if there is a better condition to filter the lines with columns)
d1 = [line for line in log if len(line) > 50 or " = " in line]
d1
>>
[
'ID = 5',
'TME = KRE',
'DDFFLE = SOFYU',
'QWWRTYA = GRRZNY',
'QUWERW WALS RUSZ CRORS ELME',
'P <NULL> R 98028',
'P <NULL> R 30310',
]
But I donĀ“t know how to get the output I'm looking for as follows. Thanks for any help
[
'ID = 5',
'TME = KRE',
'DDFFLE = SOFYU',
'QWWRTYA = GRRZNY',
'QUWERW = P',
'WALS = <NULL>',
'RUSZ = R',
'CRORS = 98028',
'QUWERW = P',
'WALS = <NULL>',
'RUSZ = R',
'CRORS = 30310'
]
Finding the = is straight-forward. One way to find the column values might be, as follows, to identify header rows that contain the headings, and then zipping the following rows when splitting by white-space.
items_list = []
for item in log:
if '=' in item:
items_list.append(item)
elif len(item.split()) > 3:
splits = item.split()
if all(header in splits for header in ['QUWERW', 'WALS', 'RUSZ', 'CRORS']):
headers = splits
else:
for lhs,rhs in zip(headers,splits):
items_list.append(f'{lhs} = {rhs}')
print('\n'.join(items_list))

Getting rid of white space between name, number and height

I have txt file like this;
name lastname 17 189cm
How do I get it to be like this?
name lastname, 17, 189cm
Using str.strip and str.split:
>>> my_string = 'name lastname 17 189cm'
>>> s = list(map(str.strip, my_string.split()))
>>> ', '.join([' '.join(s[:2]), *s[2:] ])
'name lastname, 17, 189cm'
You can use regex to replace multiple spaces (or tabs) with a comma:
import re
text = 'name lastname 17 189cm'
re.sub(r'\s\s+|\t', ', ', text)
text = 'name lastname 17 189cm'
out = ', '.join(text.rsplit(maxsplit=2)) # if sep is not provided then any consecutive whitespace is a separator
print(out) # name lastname, 17, 189cm
You could use re.sub:
import re
s = "name lastname 17 189cm"
re.sub("[ ]{2,}",", ", s)
PS: for the first problem you proposed, I had the following solution:
s = "name lastname 17 189cm"
s[::-1].replace(" ",",", 2)[::-1]

Convert a list of tab prefixed strings to a dictionary

Text mining attempts here, I would like to turn the below:
a=['Colors.of.the universe:\n',
' Black: 111\n',
' Grey: 222\n',
' White: 11\n'
'Movies of the week:\n',
' Mission Impossible: 121\n',
' Die_Hard: 123\n',
' Jurassic Park: 33\n',
'Lands.categories.said:\n',
' Desert: 33212\n',
' forest: 4532\n',
' grassland : 431\n',
' tundra : 243451\n']
to this:
{'Colors.of.the universe':{Black:111,Grey:222,White:11},
'Movies of the week':{Mission Impossible:121,Die_Hard:123,Jurassic Park:33},
'Lands.categories.said': {Desert:33212,forest:4532,grassland:431,tundra:243451}}
Tried this code below but it was not good:
{words[1]:words[1:] for words in a}
which gives
{'o': 'olors.of.the universe:\n',
' ': ' tundra : 243451\n',
'a': 'ands.categories.said:\n'}
It only takes the first word as the key which is not what's needed.
A dict comprehension is an interesting approach.
a = ['Colors.of.the universe:\n',
' Black: 111\n',
' Grey: 222\n',
' White: 11\n',
'Movies of the week:\n',
' Mission Impossible: 121\n',
' Die_Hard: 123\n',
' Jurassic Park: 33\n',
'Lands.categories.said:\n',
' Desert: 33212\n',
' forest: 4532\n',
' grassland : 431\n',
' tundra : 243451\n']
result = dict()
current_key = None
for w in a:
# If starts with tab - its an item (under category)
if w.startswith(' '):
# Splitting item (i.e. ' Desert: 33212\n' -> [' Desert', ' 33212\n']
splitted = w.split(':')
# Setting the key and the value of the item
# Removing redundant spaces and '\n'
# Converting value to number
k, v = splitted[0].strip(), int(splitted[1].replace('\n', ''))
result[current_key][k] = v
# Else, it's a category
else:
# Removing ':' and '\n' form category name
current_key = w.replace(':', '').replace('\n', '')
# If category not exist - create a dictionary for it
if not current_key in result.keys():
result[current_key] = {}
# {'Colors.of.the universe': {'Black': 111, 'Grey': 222, 'White': 11}, 'Movies of the week': {'Mission Impossible': 121, 'Die_Hard': 123, 'Jurassic Park': 33}, 'Lands.categories.said': {'Desert': 33212, 'forest': 4532, 'grassland': 431, 'tundra': 243451}}
print(result)
That's really close to valid YAML already. You could just quote the property labels and parse. And parsing a known format is MUCH superior to dealing with and/or inventing your own. Even if you're just exploring base python, exploring good practices is just as (probably more) important.
import re
import yaml
raw = ['Colors.of.the universe:\n',
' Black: 111\n',
' Grey: 222\n',
' White: 11\n',
'Movies of the week:\n',
' Mission Impossible: 121\n',
' Die_Hard: 123\n',
' Jurassic Park: 33\n',
'Lands.categories.said:\n',
' Desert: 33212\n',
' forest: 4532\n',
' grassland : 431\n',
' tundra : 243451\n']
# Fix spaces in property names
fixed = []
for line in raw:
match = re.match(r'^( *)(\S.*?): ?(\S*)\s*', line)
if match:
fixed.append('{indent}{safe_label}:{value}'.format(
indent = match.group(1),
safe_label = "'{}'".format(match.group(2)),
value = ' ' + match.group(3) if match.group(3) else ''
))
else:
raise Exception("regex failed")
parsed = yaml.load('\n'.join(fixed), Loader=yaml.FullLoader)
print(parsed)

Using regex to parse kindle "My Clippings.txt" file

I am currently trying to use python to parse the notes file for my kindle so that I can keep them more organized than the chronologically ordered list that the kindle automatically saves notes in. Unfortunately, I'm having trouble using regex to parse the file. Here's my code so far:
import re
def parse_file(in_file):
read_file = open(in_file, 'r')
file_lines = read_file.readlines()
read_file.close()
raw_note = "".join(file_lines)
# Regex parts
title_regex = "(.+)"
title_author_regex = "(.+) \((.+)\)"
loc_norange_regex = "(.+) (Location|on Page) ([0-9]+)"
loc_range_regex = "(.+) (Location|on Page) ([0-9]+)-([0-9]+)"
date_regex = "([a-zA-Z]+), ([a-zA-Z]+) ([0-9]+), ([0-9]+)" # Date
time_regex = "([0-9]+):([0-9]+) (AM|PM)" # Time
content_regex = "(.*)"
footer_regex = "=+"
nl_re = "\r*\n"
# No author
regex_noauthor_str =\
title_regex + nl_re +\
"- Your " + loc_range_regex + " | Added on " +\
date_regex + ", " + time_regex + nl_re +\
content_regex + nl_re +\
footer_regex
regex_noauthor = re.compile(regex_noauthor_str)
print regex_noauthor.findall(raw_note)
parse_file("testnotes")
Here is the contents of "testnotes":
Title
- Your Highlight Location 3360-3362 | Added on Wednesday, March 21, 2012, 12:16 AM
Note content goes here
==========
What I want:
[('Title', 'Highlight', 'Location', '3360', '3362', 'Wednesday', 'March', '21', '2012', '12', '16', 'AM',
But when I run the program, I get:
[('Title', 'Highlight', 'Location', '3360', '3362', '', '', '', '', '', '', '', '')]
I'm fairly new to regex, but I feel like this should be fairly straightforward.
When you say " | Added on ", you need to escape the |.
Replace that string with " \| Added on "
You need to escape the | in "- Your " + loc_range_regex + " | Added on " +\
to: "- Your " + loc_range_regex + " \| Added on " +\
| is the OR operator in a regex.
Should anyone need an update to this, the following works with Paperwhite & Voyage Kindles in 2017 : https://gist.github.com/laffan/7b945d256028d2ffaacd4d99be40ca34

how to rename key value in python

how can i rename key value in python?
i have this code :
t = { u'last_name': [u'hbkjh'], u'no_of_nights': [u'1'], u'check_in': [u'2012-03-19'], u'no_of_adult': [u'', u'1'], u'csrfmiddlewaretoken': [u'05e5bdb542c3be7515b87e8160c347a0'], u'memo': [u'kjhbn'], u'totalcost': [u'1800.0'], u'product': [u'4'], u'exp_month': [u'1'], u'quantity': [u'2'], u'price': [u'900.0'], u'first_name': [u'sdhjb'], u'no_of_kid': [u'', u'0'], u'exp_year': [u'2012'], u'check_out': [u'2012-03-20'], u'email': [u'ebmalifer#agile.com.ph'], u'contact': [u'3546576'], u'extra_test1': [u'jknj'], u'extra_test2': [u'jnjl'], u'security_code': [u'3245'], u'extra_charged': [u'200.0']}
key = {str(k): str(v[0]) for k,v in t.iteritems() if k.startswith('extra_')}
array = []
for val in key:
data = str(val) + ' = ' + key[val] + ','
array.append(data)
print array
it give me this :
["extra_charged = 200.0,", "extra_test1 = jknj,", "extra_test2 = jnjl,"]
what should i do to remove the 'extra_' and it makes the output like this:
["CHARGED = 200.0,", "TEST1 = jknj,", "TEST2 = jnjl,"]
can anyone have an idea about my case?
thanks in advance ...
So, array indexing can strip off the first 6 characters, and upper() should uppercase it.
Replace that one data= line with:
data = str(val)[6:].upper() + ' = ' + key[val] + ','
that should work.
i found this .replace()
and i do like this ..
data = str(val).replace("extra_","").upper() + ' = ' + key[val] + ','

Categories