Python merging two CSV files

Python merging two CSV files - python

I have two CSV files. One:
s555555,7
s333333,10
s666666,9
s111111,10
s999999,9
And two:
s111111,,,,,
s222222,,,,,
s333333,,,,,
s444444,,,,,
s555555,,,,,
s666666,,,,,
s777777,,,,,
I want to end up with:
[['s111111', '10', '', '', '', ''],
['s222222', '', '', '', '', ''],
['s333333', '10', '', '', '', ''],
['s444444', '', '', '', '', ''],
['s555555', '7', '', '', '', ''],
['s666666', '9', '', '', '', ''],
['s777777', '', '', '', '', '']]
Here's my code:
new_marks = get_marks_from_file('assign1_marks.csv')
marks = get_marks_from_file('marks.csv')
def merge_marks(all_marks, new_marks, column):
for n in range(len(new_marks)):
for a in range(len(all_marks)):
if all_marks[a][0]==new_marks[n][0]:
all_marks[a][column]= new_marks[n][column]
return marks
What am I doing wrong? I keep getting:
>>> merge_marks(marks, new_marks, 1)
[['s111111', '', '', '', '', ''],
['s222222', '', '', '', '', ''],
['s333333', '', '', '', '', ''],
['s444444', '', '', '', '', ''],
['s555555', '7', '', '', '', ''],
['s666666', '', '', '', '', ''],
['s777777', '', '', '', '', '']]

The line
return marks
has to be unindented by three levels, to get it out of both for loops and the if statement. Right now it is returning with the first all_marks[a][0]==new_marks[n][0] match it finds and never replacing the others.
You also want to return all_marks rather than marks: In this case, the global variable marks happens to be the same and is also changed, but it would fail if you called it with a variable named literally anything else.
The solution is thus:
def merge_marks(all_marks, new_marks, column):
for n in range(len(new_marks)):
for a in range(len(all_marks)):
if all_marks[a][0]==new_marks[n][0]:
all_marks[a][column]= new_marks[n][column]
return all_marks

Related

Iterating over list with while sentence and values not being excluded with list.remove

I'm running a code to clean a database. Basically, if some value appears in a list they should be removed.
Below you can see the code:
pattern = re.compile("((?:\d{10}|\d{9}|\d{8}|\d{7}|\d{6}|\d{5}|\d{4})(?:-?[\d]))?(?!\S)")
cc = pattern.findall(a)
print("cpf:", cpf)
print("ag:", ag)
print("cc start:",cc)
for i in cc:
print("i:",i)
try:
while i in ag: cc.remove(i)
except:pass
try:
while i in cpf:cc.remove(i)
except:pass
try:
while "" in i:cc.remove(i)
except:pass
print("final cc:",cc)
It prints in my screen the following:
cpf: ['00770991092']
ag: 3527
cc start: ['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '00770991092', '', '', '', '', '', '', '', '', '01068651-0', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']
i:
i: 01068651-0
final cc: ['00770991092']
Well, the '' values are removed, that's seem to be working fine. However since '00770991092' is a value inside cpf it should've been removed, but it hasn't. In the "final cc" that's the value I'm getting and it should be '01068651-0'.
Even If I run this check:
if cc in cpf:print(True)
It confirms it is True.
What am I missing?
PS.: I find quite intriguing that when I print(i) inside the for sentence only two values show up (and one is empty).

Modifying a list as you're iterating over it doesn't work very well. Is building a new list an option? Something like:
filtered_cc = [
i for i in cc
if not (i in ag or i in cpf or i == "")
]

Python: Regular expressions - extract Chinese text

I'm trying to pull out the province and city names from the following text (this is html, but I removed some of the escape characters). However, the regular expression I wrote returns a blank list.
When I tested the code on a re website (for example, https://regex101.com/), it seems to work, but it doesn't work when I write it in the script.
Here is a shortened version of my code (the html dump is much longer).
Any help would be appreciated.
import re
text = 'try window.getAreaStat = [provinceName:湖北省,provinceShortName:湖北,confirmedCount:3554,suspectedCount:0,curedCount:80,deadCount:125,comment:待明确地区：治愈 30,cities:[cityName:武汉,confirmedCount:1905,suspectedCount:0,curedCount:47,deadCount:104,cityName:黄冈,confirmedCount:324,suspectedCount:0,curedCount:2,deadCount:5,cityName:孝感,confirmedCount:274,suspectedCount:0,curedCount:0,deadCount:3,cityName:荆门,confirmedCount:142,suspectedCount:0,curedCount:0,deadCount:4,cityName:襄阳,confirmedCount:131,suspectedCount:0,curedCount:0,deadCount:0,cityName:随州,confirmedCount:116,suspectedCount:0,curedCount:0,deadCount:0,cityName:咸宁,confirmedCount:112,suspectedCount:0,curedCount:0,deadCount:0,cityName:荆州,confirmedCount:101,suspectedCount:0,curedCount:1,deadCount:2,cityName:十堰,confirmedCount:88,suspectedCount:0,curedCount:0,deadCount:0,cityName:黄石,confirmedCount:86,suspectedCount:0,curedCount:0,deadCount:1,cityName:鄂州,confirmedCount:84,suspectedCount:0,curedCount:0,deadCount:1,cityName:宜昌,confirmedCount:63,suspectedCount:0,curedCount:0,deadCount:1,cityName:恩施州,confirmedCount:51,suspectedCount:0,curedCount:0,deadCount:0,cityName:天门,confirmedCount:34,suspectedCount:0,curedCount:0,deadCount:3,cityName:仙桃,confirmedCount:32,suspectedCount:0,curedCount:0,deadCount:0,cityName:潜江,confirmedCount:8,suspectedCount:0,curedCount:0,deadCount:1,cityName:神农架林区,confirmedCount:3,suspectedCount:0,curedCount:0,deadCount:0],provinceName:浙江省,provinceShortName:浙江,confirmedCount:296,suspectedCount:0,curedCount:3,deadCount:0,comment:,cities:[cityName:温州,confirmedCount:114,suspectedCount:0,curedCount:3,deadCount:0,cityName:杭州,confirmedCount:51,suspectedCount:0,curedCount:0,deadCount:0,cityName:台州,confirmedCount:40,suspectedCount:0,curedCount:0,deadCount:0,cityName:宁波,confirmedCount:20,suspectedCount:0,curedCount:0,deadCount:0,cityName:绍兴,confirmedCount:19,suspectedCount:0,curedCount:0,deadCount:0,cityName:嘉兴,confirmedCount:14,suspectedCount:0,curedCount:0,deadCount:0,cityName:金华,confirmedCount:13,suspectedCount:0,curedCount:0,deadCount:0,cityName:衢州,confirmedCount:8,suspectedCount:0,curedCount:0,deadCount:0,cityName:舟山,confirmedCount:6,suspectedCount:0,curedCount:0,deadCount:0,cityName:丽水,confirmedCount:6,suspectedCount:0,curedCount:0,deadCount:0,cityName:湖州,confirmedCount:5,suspectedCount:0,curedCount:0,deadCount:0],provinceName:广东省,provinceShortName:广东,confirmedCount:241,suspectedCount:0,curedCount:5,deadCount:0,comment:,cities:[cityName:广州,confirmedCount:63,suspectedCount:0,curedCount:0,deadCount:0,cityName:深圳,confirmedCount:63,suspectedCount:0,curedCount:4,deadCount:0,cityName:佛山,confirmedCount:18,suspectedCount:0,curedCount:0,deadCount:0,cityName:珠海,confirmedCount:14,suspectedCount:0,curedCount:0,deadCount:0,cityName:惠州,confirmedCount:12,suspectedCount:0,curedCount:1,deadCount:0,cityName:中山,confirmedCount:12,suspectedCount:0,curedCount:0,deadCount:0,cityName:阳江,confirmedCount:10,suspectedCount:0,curedCount:0,deadCount:0,cityName:湛江,confirmedCount:7,suspectedCount:0,curedCount:0,deadCount:0,cityName:东莞,confirmedCount:7,suspectedCount:0,curedCount:0,deadCount:0,cityName:清远,confirmedCount:6,suspectedCount:0,curedCount:0,deadCount:0,cityName:汕头,confirmedCount:6,suspectedCount:0,curedCount:0,deadCount:0,cityName:揭阳,confirmedCount:6,suspectedCount:0,curedCount:0,deadCount:0,cityName:肇庆,confirmedCount:5,suspectedCount:0,curedCount:0,deadCount:0,cityName:韶关,confirmedCount:4,suspectedCount:0,curedCount:0,deadCount:0,cityName:梅州,confirmedCount:4,suspectedCount:0,curedCount:0,deadCount:0,cityName:茂名,confirmedCount:2,suspectedCount:0,curedCount:0,deadCount:0,cityName:汕尾,confirmedCount:1,suspectedCount:0,curedCount:0,deadCount:0,cityName:河源'
regex = "((?<=provinceName:)|(?<=cityName:)).*?(?=,)"
province = re.findall(regex, text)
print(province)
['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']

From this answer, re.findall will return all the captured groups. I tried your regex in https://regexr101.com and it all return blank captured group.
You can use non-capturing group by adding (?:...)
regex = "(?:(?<=provinceName:)|(?<=cityName:)).*?(?=,)"
Preview on Repl.it

Getting too many matches for one string segment in regex (python)

I'm trying to write a regex script for finding all instances of money in a text, and my code works correctly but I can't figure out why it's finding multiple versions of things in my strings.
For example, in this code:
string = "$50.00"
print "number dollars: "
print re.findall("\-?\(?\$?\s*\-?\s*\(?(((\d{1,3}((\,\d{3})*|\d*))?(\.\d{1,4})?)|((\d{1,3}((\,\d{3})*|\d*))(\.\d{0,4})?))\)?\ ?(one)?\ ?(two)?\ ?(three)?\ ?(four)?\ ?(five)?\ ?(six)?\ ?(seven)?\ ?(eight)?\ ?(nine)?\ ?(ten)?\ ?(eleven)?\ ?(twelve)?\ ?(thirteen)?\ ?(fourteen)?\ ?(fifteen)?\ ?(sixteen)?\ ?(seventeen)?\ ?(eighteen)?\ ?(nineteen)?\ ?(hundred)?\ ?(thousand)?\ ?(million)?\ ?(billion)?\ ?(trillion)?\ ?(dollars)?\ ?(pounds)?\ ?(euros)?", string)
This is the result I get:
number dollars:
[('50.00', '50.00', '50', '', '', '.00', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', ''), ('', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '')]
this is the regex by itself:
\-?\(?\$?\s*\-?\s*\(?(((\d{1,3}((\,\d{3})*|\d*))?(\.\d{1,4})?)|((\d{1,3}((\,\d{3})*|\d*))(\.\d{0,4})?))\)?\ ?(one)?\ ?(two)?\ ?(three)?\ ?(four)?\ ?(five)?\ ?(six)?\ ?(seven)?\ ?(eight)?\ ?(nine)?\ ?(ten)?\ ?(eleven)?\ ?(twelve)?\ ?(thirteen)?\ ?(fourteen)?\ ?(fifteen)?\ ?(sixteen)?\ ?(seventeen)?\ ?(eighteen)?\ ?(nineteen)?\ ?(hundred)?\ ?(thousand)?\ ?(million)?\ ?(billion)?\ ?(trillion)?\ ?(dollars)?\ ?(pounds)?\ ?(euros)?

The results contain a string from each and every parenthesized group, corresponding to the portion of the string matched by the subexpression in each group, in order of opening parentheses (e.g. (\d+(\.\d+)?) would give ['50.00', '.00']). To prevent the contents of a group from being captured, prefix the subexpression with a ?: (e.g. (?:,\d{3})*|\d*)); this creates a non-capturing group.
The majority of the groups are for words that don't appear in the string, which produces most of empty strings in the result.

How to know index of a decimal value in a python list

I have a list like the following
['UIS', '', '', '', '', '', '', '', '', '02/05/2014', 'N', '', '', '', '', '9:30:00', '', '', '', '', '', '', '', '', '31.8000', '', '', '', '', '', '', '3591', 'O', '', '', '', '', '0', '', '', '', '', '', '', '', '', '', '', '', '', '', '0']
Now how to know which element is decimal here , basically I want to track the 31.8000 value from the list. Is it possible ?

You can reliably find if a variable has a floating point number or not, by literal evaluating and checking if it is of type float, like this
from ast import literal_eval
result = []
for item in data:
temp = ""
try:
temp = literal_eval(item)
except (SyntaxError, ValueError):
pass
if isinstance(temp, float):
result.append(item)
print result
# ['31.8000']
If you want to get the indexes, just enumerate the data like this
for idx, item in enumerate(data):
...
...
and while preparing the result, add the index instead of the actual element
result.append(idx)

Iterate over the list and check if float() succeeds:
floatables = []
for i,item in enumerate(data):
try:
float(item)
floatables.append(i)
except ValueError:
pass
print floatables
Alternatively, if you want to match the decimal format you can use
import re
decimals = []
for i,item in enumerate(data):
if re.match("^\d+?\.\d+?$", item) is not None:
decimals.append(i)
print decimals

Using a list comprehension and a regular expression match:
>>> import re
>>> [float(i) for i in x if re.match(r'^[+-]\d+?[.]\d+$',i)]
[31.8]
If you want to tracking the indexes of the floats:
>>> [x.index(i) for i in x if re.match(r'[+-]?\d+?[.]\d+',i)]
[24]

data = ['UIS', '', '', '', '', '', '', '', '', '02/05/2014', 'N', '', '', '', '', '9:30:00', '', '', '', '', '', '', '', '', '31.8000', '', '', '', '', '', '', '3591', 'O', '', '', '', '', '0', '', '', '', '', '', '', '', '', '', '', '', '', '', '0']
import decimal
target = decimal.Decimal('31.8000')
def is_target(input):
try:
return decimal.Decimal(input) == target
except decimal.InvalidOperation, e:
pass
output = filter( is_target, data)
print output

Python array deleting items

I have array
a=['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '151 ihi Chun', '151 ihi Chun', '149 st Hg', '149 st Hg', '125 Tatane', '125 Tatane', '174 Sunnygat', '174 Sunnygat', '174 Sunnygat', '126 Nank', '126 Nank', '162 Rass', '162 Rass']
I want to remove all '' objects, but cant.
a.remove('')
or while a.index(''): a.remove('')
Are don't help..

Use a filter() call with None as the filter (tests for truth, so non-emptyness):
a = filter(None, a)
or a list comprehension:
a = [e for e in a if e]
If you need to explicitly allow other 'false' values and only want to filter out empty strings, use:
a = [e for e in a if e != '']

If those items are actually '', in other words, empty strings, then you can use the following:
a = [x for x in a if x]
Since an empty string evaluates to false when used in a truth testing statement.

try
for i in a:
a.remove('')
a.remove('')
i am also not sure why in first time it's not removing all but in second time sure it removes all the blank

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python merging two CSV files - python

Related

Iterating over list with while sentence and values not being excluded with list.remove

Python: Regular expressions - extract Chinese text

Getting too many matches for one string segment in regex (python)

How to know index of a decimal value in a python list

Python array deleting items

Categories

Resources