Need a regular expression to split String in Python [closed] - python

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
str = 'FW201703002082017MF0164EXESTBOPF01163500116000 0001201700258000580000116000.WALTERS BAY BOGAWANTALAWA 1M'
Above expression is the string need to be split and extract separately as follow:
Borkername = FW
Sale year = 2017
Saleno = 0300
sale_dte = 20.08.2017 # date need to be format
Factoryno = MF0164
Catalogu code= EXEST
Grade =BOPF
Gross weight =01163.50 #decimal point needed
Net Weight = 01163.50 #decimal point needed
Lot_No = 0001
invoice_year = 2017
invoice_no = 00258
price = 000580.00 #decimal point needed
Netweight = 01160.00 #decimal point needed
Buyer = 'WALTERS BAY BOGAWANTALAWA'
Buyer_code = '1M'
This is a single line without any denominators. So, kindly help me out to write a regular expression to separate each field to column of panda in python.
For example:
(\A[A-Z]{2})
This will give me the first 2 characters. How can I get next 4 digits as the year?

You need to do this in two goes. First use a regular expression to split the string up into (mostly) fixed length segments. Then with the list you get back, fix the fields manually into the format you require. For example:
import re
import csv
headings = [
"Borkername", "Sale year", "Saleno", "sale_dte", "Factoryno", "Catalogu code", "Grade", "Gross weight",
"Net Weight", "Lot_No", "invoice_year", "invoice_no", "price", "Netweight", "Buyer", "Buyer_code"]
re_fields = re.compile(r'(.{2})(.{4})(.{3})(.{8})(.{6})(.{5})(.{4})(.{7})(.{7}) (.{4})(.{4})(.{5})(.{8})(.{7}).(.*?) (.{2})$')
with open('input.txt') as f_input, open('output.csv', 'w', newline='') as f_output:
csv_writer = csv.writer(f_output)
csv_writer.writerow(headings)
for line in f_input:
fields = list(re_fields.match(line).groups())
fields[3] = "{}.{}.{}".format(fields[3][:2], fields[3][2:4], fields[3][4:])
fields[7] = float("{}.{}".format(fields[7][:5], fields[7][5:]))
fields[8] = float("{}.{}".format(fields[8][:5], fields[8][5:]))
fields[12] = float("{}.{}".format(fields[12][:6], fields[12][6:]))
fields[13] = float("{}.{}".format(fields[13][:5], fields[13][5:]))
csv_writer.writerow(fields)
This would give you output.csv containing:
Borkername,Sale year,Saleno,sale_dte,Factoryno,Catalogu code,Grade,Gross weight,Net Weight,Lot_No,invoice_year,invoice_no,price,Netweight,Buyer,Buyer_code
FW,2017,030,02.08.2017,MF0164,EXEST,BOPF,1163.5,1160.0,0001,2017,00258,580.0,1160.0,WALTERS BAY BOGAWANTALAWA,1M
This can then be read in using Pandas:
import pandas as pd
data = pd.read_csv('output.csv')
print data
Which gives:
Borkername Sale year Saleno sale_dte Factoryno Catalogu code Grade Gross weight Net Weight Lot_No \
0 FW 2017 30 02.08.2017 MF0164 EXEST BOPF 1163.5 1160.0 1
invoice_year invoice_no price Netweight Buyer Buyer_code
0 2017 258 580.0 1160.0 WALTERS BAY BOGAWANTALAWA 1M

Related

How to tackle csv files in Python without Pandas

I've been given a homework task to get data from a csv file without using Pandas. The info in the csv file contains headers such as...
work year:
experience level: EN Entry-level / Junior MI Mid-level / Inter- mediate SE Senior-level / Expert EX Executive-level / Director
employment type: PT Part-time FT Full-time CT Contract FL Freelance
job title:
salary:
salary currency:
salaryinusd: The salary in USD
employee residence: Employee’s primary country of residence
remote ratio:
One of the questions is:
For each experience level, compute the average salary (over 3 years (2020/21/22)) for each job title?
The only way I've managed to do this is to iterate through the csv and add a load of 'if' statements according to the experience level and job title, but this is taking me forever.
Any ideas of how to tackle this differently? Not using any libraries/modules.
Example of my code:
with open('/Users/xxx/Desktop/ds_salaries.csv', 'r') as f:
csv_reader = f.readlines()
for row in csv_reader[1:]:
new_row = row.split(',')
experience_level = new_row[2]
job_title = new_row[4]
salary_in_usd = new_row[7]
if experience_level == 'EN' and job_title == 'AI Scientist':
en_ai_scientist += int(salary_in_usd)
count_en_ai_scientist += 1
avg_en_ai_scientist = en_ai_scientist / count_en_ai_scientist
print(avg_en_ai_scientist)
Data:
When working out an example like this, I find it helpful to ask, "What data structure would make this question easy to answer?"
For example, the question asks
For each experience level, compute the average salary (over 3 years (2020/21/22)) for each job title?
To me, this implies that I want a dictionary keyed by a tuple of experience level and job title, with the salaries of every person who matches. Something like this:
data = {
("EN", "AI Scientist"): [1000, 2000, 3000],
("SE", "AI Scientist"): [2000, 3000, 4000],
}
The next question is: how do I get my data into that format? I would read the data in with csv.DictReader, and add each salary number into the structure.
data = {}
with open('input.csv', newline='') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
experience_level = row['first_name']
job_title = row['last_name']
key = experience_level, job_title
if key not in data:
# provide default value if no key exists
# look at collections.defaultdict if you want to see a better way to do this
data[key] = []
data[key].append(row['salary_in_usd'])
Now that you have your data organized, you can compute average salaries:
for (experience_level, job_title), salary_data in data:
print(experience_level, job_title, sum(salary_data)/len(salary_data))

Using Python, how can I turn a text file list into a 5 column data set where some groups of data have less than 5 lines? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
Most days, I get a list of jobs that look a little like this:
1 Group type 1 # filename = "280321_AnonymisedTestData.txt"
2 Job Title A
3 Employer
4 Location
5 Z hours per week
6 Job Title B
7 Employer
8 Location
9 Y hours per week
10 Group type 1
11 Job Title C
12 Employer
13 Location
14 Y hours per week
15 Group type 1
16 Job Title D
17 Employer
18 Location
19 X hours per week
As you can see, most groupings or sets of text have 5 lines or elements, but some, (lines 6 - 9) have only four elements in their set.
I'd like to put this data into a table and to do this I have decided to put the word "General" on a line on it's own in front of the sets with four elements in them.
So far my 'code' looks like this:
with open('280321_AnonymisedTestData.txt') as mylines:
x = 6
while (x < 7):
for l in mylines:
if not (l.startswith("Group type 1") or l.startswith("Group type 2") or l.startswith("Group type 3)") l = "General" + "\n" + l
and the Output shows a syntax error at line 5 at the point "l = "General" + ..."
I've got to work out the flow of the program so that the next 3 lines don't get "General" inserted in between them and I intend to put the sets in a data-set, but for now, can I get some advice on where I'm going wrong here?
Thanks for reading.
I despair that people can't reason through tasks like this. Just do it logically, one step at a time. What data do I HAVE, what data do I NEED, how do I get there?
dataset = []
with open('280321_AnonymisedTestData.txt') as mylines:
accum = []
for line in mylines:
if not accum and not line.startswith("Group"):
accum = ["General"]
accum.append( line.strip() )
if len(accum) == 5:
dataset.append(accum)
accum = []
from pprint import pprint
pprint(dataset)
Output:
[['Group type 1', 'Job Title A', 'Employer', 'Location', 'Z hours per week'],
['General', 'Job Title B', 'Employer', 'Location', 'Y hours per week'],
['Group type 1', 'Job Title C', 'Employer', 'Location', 'Y hours per week'],
['Group type 1', 'Job Title D', 'Employer', 'Location', 'X hours per week']]

How to separate a CVS column by position in Python [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
I have data that I want to separate into 3 columns form the one column in a CVS file.
The original file looks like this:
0400000006340000000000965871
0700000007850000000000336487
0100000003360000000000444444
I would like to separate the columns to resemble the list below, while still preserving the leading zeros:
04 0000000634 0000000000965871
07 0000000785 0000000000336487
01 0000000336 0000000000444444
I can upload the file onto Python, but I don't know which Delimiter or positioning I have to use. The code I have so far:
import pandas as pd
df = pd.read_cvs('new_numbers.txt', header=None)
Thank you for the help.
Use the pandas read_fwf() method - which stands for "fixed-width format":
pd.read_fwf('new_numbers.txt', widths=[2, 10, 16], header=None)
which will drop the leading zeroes:
0 1 2
0 4 634 965871
1 7 785 336487
2 1 336 444444
To keep them, specify the dtype as strings with object:
pd.read_fwf('new_numbers.txt', widths=[2, 10, 16], dtype=object, header=None)
Output:
0 1 2
0 04 0000000634 0000000000965871
1 07 0000000785 0000000000336487
2 01 0000000336 0000000000444444
It looks like there is no delimiter and you are using fixed lengths.
Access fixed lengths by their position in a list notation.
So for instance:
str1 = "0400000006340000000000965871"
str1A = str1[:2]
str1B = str1[3:14]
str1C = str1[14:]
I wouldn't particularly bother with pandas for it unless you need a dataframe out the far end.
You don't need pandas to load your text file and read its content (and also, you aren't loading a csv file).
with open("new_numbers.txt") as f:
lines = f.readlines()
What I suggest you is to use re module.
import re
PATTERN = re.compile(r"(0*[1-9]+)(0*[1-9]+)(0*[1-9]+)")
You can check here the result of this expression on your example.
Then you need to get matches from your lines, and join them with a space.
matches = []
for line in lines:
match = PATTERN.match(line)
first, second, third = match.group(1, 2, 3)
matches.append(" ".join([first, second, third]))
At the end, matches will be an array of space-separated numbers (with leading zeros).
At this point you can write them to another file, or do whatever you need to do with it.
towrite = "\n".join(matches)
with open("output.txt", "w") as f:
f.write(towrite)

What the easiest way to convert an API output to a data frame [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
AcledData = pd.read_csv("https://api.acleddata.com/acled/read?terms=accept&country=Afghanistan&date=20200315.csv", sep=',',quotechar='"', encoding ='utf-8')
print(AcledData)
Empty DataFrame
Columns: [{"status":200, success:true, last_update:117, count:500, data:[{"data_id":"6996791", iso:"4", event_id_cnty:"AFG44631", event_id_no_cnty:"44631", event_date:"2020-03-21", year:"2020", time_precision:"1", event_type:"Battles", sub_event_type:"Armed clash", actor1:"Taliban", assoc_actor_1:"", inter1:"2", actor2:"Military Forces of Afghanistan (2014-)", assoc_actor_2:"", inter2:"1", interaction:"12", region:"Caucasus and Central Asia", country:"Afghanistan", admin1:"Balkh", admin2:"Dawlat Abad", admin3:"", location:"Dawlat Abad", latitude:"36.9882", longitude:"66.8207", geo_precision:"2", source:"Xinhua; Khaama Press", source_scale:"National-International", notes:"On 21 March 2020, 12 Taliban militants including 2 commanders were killed and 5 including a commander were wounded when Afghan forces repulsed their attack in Dawlat Abad district, Balkh.", fatalities:"12", timestamp:"1584984341", iso3:"AFG"}, {"data_id":"6997066", iso:"4".1, event_id_cnty:"AFG44667", event_id_no_cnty:"44667", event_date:"2020-03-21".1, year:"2020".1, time_precision:"1".1, event_type:"Violence against civilians", sub_event_type:"Attack", actor1:"Unidentified Armed Group (Afghanistan)", assoc_actor_1:"".1, inter1:"3", actor2:"Civilians (Afghanistan)", assoc_actor_2:"Muslim Group (Afghanistan); Teachers (Afghanistan)", inter2:"7", interaction:"37", region:"Caucasus and Central Asia".1, country:"Afghanistan".1, admin1:"Kabul", admin2:"Kabul", admin3:"".1, location:"Kabul", latitude:"34.5167", longitude:"69.1833", geo_precision:"1", source:"Pajhwok Afghan News", source_scale:"National", notes:"On 21 March 2020.1, 1 religious scholar and teacher was killed by an unknown gunmen in Kabul city.", fatalities:"1", timestamp:"1584984341".1, iso3:"AFG"}.1, {"data_id":"6997171", iso:"4".2, event_id_cnty:"AFG44715", event_id_no_cnty:"44715", event_date:"2020-03-21".2, year:"2020".2, time_precision:"2", event_type:"Battles".1, sub_event_type:"Armed clash".1, actor1:"Taliban".1, assoc_actor_1:"".2, inter1:"2".1, actor2:"Military Forces of Afghanistan (2014-)".1, assoc_actor_2:"".1, inter2:"1".1, interaction:"12".1, region:"Caucasus and Central Asia".2, country:"Afghanistan".2, admin1:"Balkh".1, admin2:"Nahri Shahi", admin3:"".2, location:"Nahri Shahi", latitude:"36.8544", longitude:"67.1800", geo_precision:"2".1, source:"Voice of Jihad", source_scale:"Other", notes:"As reported on 21 March 2020, 3 Afghan security personnel were killed and 5 were wounded following an attack by Taliban militants on a check point in Nahri Shahi district, Balkh. Fatalities coded as 0 (VoJ reported 3 fatalities).", fatalities:"0", ...]
Index: []
The query returns a Json string, but the relevant data is nested into that json. You will have to:
read the returned string as a json object
use the relevant part of that object to feed a dataframe
For example using urllib.request, you could do:
data = json.load(urllib.request.urlopen('https://api.acleddata.com/acled/read?terms=accept&country=Afghanistan&date=20200315.csv'))['data']
df = pd.DataFrame(data)
If you want to convert that to a csv file, no need for pandas, but you should use the csv module:
data = json.load(urllib.request.urlopen('https://api.acleddata.com/acled/read?terms=accept&country=Afghanistan&date=20200315.csv'))['data']
with open('file.csv', 'w', newline=''):
wr = csv.DictWriter(fd, fieldnames=data[0].keys())
_ = wr.writeheader()
for d in data:
_ = wr.writerow(d)
This is not csv, this is JSON
import pandas as pd
api = "https://api.acleddata.com/acled/read?terms=accept&country=Afghanistan&date=20200315.csv"
AcledData = pd.read_json(api)
the data field is then again JSON but you can use a similar technique/dataframe methods to get what you want

Comparing four parameter in file [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 7 years ago.
Improve this question
Most profitable element for every category
Must read my file and determinate most profitable element for every category in range of dates by user entries.
File:
Date|Category|Name|Price
05/01/2016|category6|Name8|4200
06/01/2016|category1|Name1|1000
07/01/2016|category2|Name2|1200
07/01/2016|category3|Name1|1000
07/01/2016|category1|Name2|1200
07/01/2016|category3|Name2|1200
07/01/2016|category2|Name1|1000
07/01/2016|category2|Name2|1200
07/01/2016|category2|Name2|1200
08/01/2016|category2|Name1|1000
09/01/2016|category4|Name7|3100
My file will be a lot bigger this is just example.
Start Date : 07/01/2016
End Date: 07/01/2016
For every date in that range program will print most profitable element for every category
Category 1:
07/01/2016|category1|Name2|1200
Name2 = 1200
Comparing prices >>> Most profitable is: Name2
Category 2:
07/01/2016|category2|Name2|1200
07/01/2016|category2|Name1|1000
07/01/2016|category2|Name2|1200
07/01/2016|category2|Name2|1200
Name1 = 1000
Name2 = 3600
Comparing prices >>> Most proftable: Name2
Category 3:
07/01/2016|category3|Name1|1000
07/01/2016|category3|Name2|1200
Name1: 1000
Name2: 1200
Comparing prices >>> Most profitable: Name2
Problem is i don't know how to compare these prices for categoris and names.
Also dates will be always on asending order.
I'm using both dictionary and lists.
INPUT AND OUTPUT:
Start Date : 07/01/2016
End Date: 07/01/2016
Category1; Most profitable is: Name2
Category2; Most profitable is: Name2
Category3; Most profitable is: Name2
in this case most profitable is Name2 for every category.
The following is not exactly what you need but should give you a fair idea to get going. I keep track of the most profitable name and value for combinations of date and category:
date_cat_profit_dict = {}
with open('data.txt') as f:
for line in f:
# split and store into variables.
# You could skip processing line
# if you are looking for specific date
date, category, name, profit = line.split('|')
# Convert to int for comparison
profit = int(profit)
# Key for storing into dict
composite_key = '{0}|{1}'.format(date, category)
# _ because we don't need the name right now
_, max_profit = (date_cat_profit_dict.
setdefault(composite_key, ('', 0)))
if max_profit < profit:
date_cat_profit_dict[composite_key] = (name, profit)
for composite_key, (name, profit) in date_cat_profit_dict.items():
print('Max for {0} : {1}, {2}'.format(composite_key, name, profit))

Categories