Loading .txt file as a dict, but exclude commented rows - python

I have some data in a txt file and I would like to load it into a list of dicts. I would normally use csv.ReadDict(open('file')), however this data does not have the key values in the first row. Instead it has a number of rows commented out before the data actually begins. Also, sometimes, the commented rows will not always be at beginning of the file, but could be at the end of the file.
However, all line should always have the same fields, and I guess I could hard-code these field names (or key values) as they shouldn't change.
Sample Date
# twitter data
# retrieved at: 07.08.2014
# total number of records: 5
# exported by: userXYZ
# fields: date, time, username, source
10.12.2013; 02:00; tweeterA; web
10.12.2013; 02:01; tweeterB; iPhone
10.13.2013; 02:04; tweeterC; android
10.13.2013; 02:08; tweeterC; web
10.13.2013; 02:10; tweeterD; iPhone
Below is the what I've been able to figure out so far, but I need some help getting it worked out.
My Code
header = ['date', 'time', 'username', 'source']
data = []
for line in open('data.txt'):
if not line.startswith('#'):
data.append(line)
Desired Format
[{'date':'10.12.2013', 'time':'02:00', 'username':'tweeterA', 'source':,'web'},
{'date':'10.12.2013', 'time':'02:01', 'username':'tweeterB', 'source':,'iPhone'},
{'date':'10.12.2013', 'time':'02:04', 'username':'tweeterC', 'source':,'android'},
{'date':'10.12.2013', 'time':'02:08', 'username':'tweeterC', 'source':,'web'},
{'date':'10.12.2013', 'time':'02:10', 'username':'tweeterD', 'source':,'iPhone'}]

If you want a list of dicts where each dict corresponds to a row try this:
list_of_dicts = [{key: value for (key, value) in zip(header, line.strip().split('; '))} for line in open('abcd.txt') if not line.strip().startswith('#')]

for line in open('data.txt'):
if not line.startswith('#'):
data.append(line.split("; "))
at least assuming I understand you correctly
or more succinct
data = [line.split("; ") for line in open("data.txt") if not line.strip().startswith("#")]
list_of_dicts = map(lambda row:dict(zip(header,row)),data)
depending on your version of python you may get an iterator back from map in which case just do
list_of_dicts = list(map(lambda row:dict(zip(header,row)),data))

Related

Parsing a zipcode boundary geojson file based on a condition, then appending to a a new json file if condition is met

I have a geojson file of zipcode boundaries.
with open('zip_geo.json') as f:
gj = geojson.load(f)
gj['features'][0]['properties']
Prints out;
{'STATEFP10': '36',
'ZCTA5CE10': '12205',
'GEOID10': '3612205',
'CLASSFP10': 'B5',
'MTFCC10': 'G6350',
'FUNCSTAT10': 'S',
'ALAND10': 40906445,
'AWATER10': 243508,
'INTPTLAT10': '+42.7187855',
'INTPTLON10': '-073.8292399',
'PARTFLG10': 'N'}
I also have a pandas dataframe with one of the fields being the zipcode.
I want to create a new geojson file only if the 'ZCTA5CEO' value of the specific element is present in the zipcode column of my dataframe.
How would I go about doing this?
I was thinking of something like; (This is pseudo code)
new_dict = {}
for index,item in enumerate(gj):
if item['features'][index]['properties']['ZCTACE10'] in df['zipcode']:
new_dict += item
The syntax of the code above is obviously wrong, but I am not sure how to parse though multiple nested dictionaries and append the results to a new dictionary.
Link to the geojson file : https://github.com/OpenDataDE/State-zip-code-GeoJSON/blob/master/ny_new_york_zip_codes_geo.min.json
In short I want to remove all the elements relating to the zipcodes that are not there in the zipcode column in my dataframe.
try this. just update your ziplist. then you can save the new json to a local file
ziplist = ['12205', '14719', '12193', '12721'] #list of zips in your dataframe
url='https://github.com/OpenDataDE/State-zip-code-GeoJSON/raw/master/ny_new_york_zip_codes_geo.min.json'
gj = requests.get(url).json()
inziplist = []
for ft in gj['features']:
if ft['properties']['ZCTA5CE10'] in ziplist:
print(ft['properties']['ZCTA5CE10'])
inziplist.append(ft)
print(len(inziplist))
new_zip_json = {}
new_zip_json['type'] = 'FeatureCollection'
new_zip_json['features'] = inziplist
new_zip_json = json.dumps(new_zip_json)

How do I count the distinct fields from a given csv/psv hybrid text file?

I believe Python is the best choice but I can be wrong.
Below is a sample from a data source in text format in Linux:
TUI,39832020:09:01,10.56| TUI,39832020:10:53,11.23| TUI,39832020:15:40,23.20
DIAN,39832020:09:04,11.56| TUI,39832020:11:45,11.23| DIAN,39832020:12:30,23.20| SLD,39832020:11:45,11.22
The size is unknown, let's presume a million rows.
Each line contains three or more sets delimited by |, and each set has fields separated by ,.
The first field in each set is the product ID. For example, on the sample above, TUI, DIAN, and SLD are products ID.
I need to find out how many types of products I have on file. For example, the first line contains 1: TUI, the second line contains 3: DIAN, TUI, and SLD.
In total, on those two lines, we can see there are three unique products.
Can anyone help?
Thank you very much. Any enlightening is appreciated.
UPDATE
I prefer a solution based in Python with Spark, i.e. pySpark.
I'm also looking for statistics like:
total amount of each product;
all records for a given time (the second field in each set, like 39832020:09:01);
minimum and maximum price for each product.
UPDATE 2
Thank you all for the code, I really appreciate. I wonder if anyone can write the data into a RDD and/or dataframe. I know that in SparkSQL it is very simple to obtain those statistics.
Thanks a lot in advance.
Thank you very much.
Similar to Accdias answer: Use a dictionary, read your file in line by line, split the data by | then by , and total up the counts in your dictionary.
myFile="lines_to_read.txt"
productCounts = dict()
with open(myFile, 'r') as linesToRead:
for thisLine in linesToRead:
for myItem in thisLine.split("|"):
productCode=myItem.split(",")
productCode=productCode[0].strip()
if productCode in productCounts:
productCounts[productCode]+=1
else:
productCounts[productCode]=1
print(productCounts)
**** Update ****
Dataframe use with Pandas so that we can query stats on the data afterwords:
import pandas as pd
myFile="lines_to_read.txt"
myData = pd.DataFrame (columns=['prodID', 'timeStamp', 'prodPrice'])
with open(myFile, 'r') as linesToRead:
for thisLine in linesToRead:
for myItem in thisLine.split("|"):
thisItem=myItem.strip('\n, " "').split(",")
myData = myData.append({'prodID':thisItem[0],'timeStamp':thisItem[1],'prodPrice':thisItem[2]}, ignore_index=True)
print(myData) # Full Table
print(myData.groupby('prodID').agg({'prodID':'count'})) # Total of prodID's
print(myData.loc[myData['timeStamp'] == '39832020:11:45']) # all lines where time = 39832020:11:45
print(myData.groupby('prodID').agg({'prodPrice':['min', 'max']})) # min/max prices

Looking for a better way do accomplish dataframe to dictionary by series

Here's a portion of what the Excel file looks like. Meant to include this the first time. Thanks for the help so far.
Name Phone Number Carrier
FirstName LastName1 3410142531 Alltel
FirstName LastName2 2437201754 AT&T
FirstName LastName3 9247224091 Boost Mobile
FirstName LastName4 6548310018 Cricket Wireless
FirstName LastName5 8811620411 Project Fi
I am converting a list of names, phone numbers, and carriers to a dictionary for easy reference by other code. The idea is separate code will be able to call a name and access that person's phone number and carrier.
I got the output I need, but I'm wondering if there were an easier way I could have accomplished this task and get the same output. Though it's fairly concise, I'm interested in any module or built in of which I'm not aware. My python skills are beginner at best. I wrote this in Thorny with Python 3.6.4. Thanks!
#Imports
import pandas as pd
import math
# Assign spreadsheet filename to `file`
file = 'Phone_Numbers.xlsx'
# Load spreadsheets
xl = pd.ExcelFile(file)
# Load a sheet into a DataFrame by name: df1
df1 = xl.parse('Sheet1', header=0)
# Put the dataframe into a dictionary to start
phone_numbers = df1.to_dict(orient='records')
# Converts PhoneNumbers.xlsx to a dictionary
x=0
temp_dict = {}
for item in phone_numbers:
temp_list = []
for key in phone_numbers[x]:
tempholder = phone_numbers[x][key]
if (isinstance(tempholder, float) or isinstance(tempholder, int)) and math.isnan(tempholder) == False: # Checks to see if there is a blank and if the phone number comes up as a float
# Converts any floats to string for use in later code
tempholder = str(int(tempholder))
else:
pass
temp_list.append(tempholder)
temp_dict[temp_list[0]] = temp_list[1:] # Makes the first item in the list the key and add the rest as values
x += 1
print(temp_dict)
Here's the desired output:
{'FirstName LastName1': ['3410142531', 'Alltel'], 'FirstName LastName2': [2437201754, 'AT&T'], 'FirstName LastName3': [9247224091, 'Boost Mobile'], 'FirstName LastName4': [6548310018, 'Cricket Wireless'], 'FirstName LastName5': [8811620411, 'Project Fi']
One way to do it would be to iterate through the dataframe and use a dictionary comprehension:
temp_dict = {row['Name']:[row['Phone Number'], row['Carrier']] for _, row in df.iterrows()}
where df is your original dataframe (the result of xl.parse('Sheet1', header=0)). This basically iterates through all rows in your dataframe, creating a dictionary key for each Name, with Phone number and carrier as it's values (in a list), as you indicated in your output.
To make sure that your phone number is not null (as you did in your loop), you could add an if clause to your dict comprehension, such as this:
temp_dict = {row['Name']:[row['Phone Number'], row['Carrier']]
for _, row in df.iterrows()
if not math.isnan(row['Phone Number'])}
df.set_index('Name').T.to_dict('list')
should do the job ,Here df is your dataframe

Read data in chunks and keep one row for each ID in Python

Imagine we have a big file with rows as follows
ID value string
1 105 abc
1 98 edg
1 100 aoafsk
2 160 oemd
2 150 adsf
...
Say the file is named file.txt and is separated by tab.
I want to keep the largest value for each ID. The expected output is
ID value string
1 105 abc
2 160 oemd
...
How can I read it by chunks and process the data? If I read the data in chunks, how can I make sure at the end of each chunk the records are complete for each ID?
Keep track of the data in a dictionary of this format:
data = {
ID: [value, 'string'],
}
As you read each line from the file, see if that ID is already in the dict. If not, add it; if it is, and the current ID is bigger, replace it in the dict.
At the end, your dict should have every biggest ID.
# init to empty dict
data = {}
# open the input file
with open('file.txt', 'r') as fp:
# read each line
for line in fp:
# grab ID, value, string
item_id, item_value, item_string = line.split()
# convert ID and value to integers
item_id = int(item_id)
item_value = int(item_value)
# if ID is not in the dict at all, or if the value we just read
# is bigger, use the current values
if item_id not in data or item_value > data[item_id][0]:
data[item_id] = [item_value, item_string]
for item_id in data:
print item_id, data[item_id][0], data[item_id][1]
Dictionaries don't enforce any specific ordering of their contents, so at the end of your program when you get the data back out of the dict, it might not be in the same order as the original file (i.e. you might see ID 2 first, followed by ID 1).
If this matters to you, you can use an OrderedDict, which retains the original insertion order of the elements.
(Did you have something specific in mind when you said "read by chunks"? If you meant a specific number of bytes, then you might run into issues if a chunk boundary happens to fall in the middle of a word...)
Code
import csv
import itertools as it
import collections as ct
with open("test.csv") as f:
reader = csv.DictReader(f, delimiter=" ") # 1
for k, g in it.groupby(reader, lambda d: d["ID"]): # 2
print(max(g, key=lambda d: float(d["value"]))) # 3
# {'value': '105', 'string': 'abc', 'ID': '1'}
# {'value': '160', 'string': 'oemd', 'ID': '2'}
Details
The with block ensures safe opening and closing of file f. The file is iterable allowing you to loop over it or ideally apply itertools.
For each line of f, csv.DictReader splits the data and maintains header-row information as key-value pairs of a dictionary ,e.g. [{'value': '105', 'string': 'abc', 'ID': '1'}, ...
This data is iterable and passed to groupby that chunks all of the data by ID. See this post from more details on how groupby works.
The the max() builtin combined with a special key function returns the dicts with the largest "value". See this tutorial for more details on the max() function.

Appending to a list from an iterated dictionary

I've written a piece of script that at the moment I'm sure can be condensed. What I'm try to achieve is an automatic version of this:
file1 = tkFileDialog.askopenfilename(title='Select the first data file')
file2 = tkFileDialog.askopenfilename(title='Select the first data file')
TurnDatabase = tkFileDialog.askopenfilename(title='Select the turn database file')
headers = pd.read_csv(file1, nrows=1).columns
data1 = pd.read_csv(file1)
data2 = pd.read_csv(file2)
This is how the data is collected.
There's many more lines of code which focus on picking out bits of the data. I'm not going to post it all.
This is what I'm trying to condense:
EntrySummary = []
for key in Entries1.viewkeys():
MeanFRH = Entries1[key].hRideF.mean()
MeanFRHC = Entries1[key].hRideFCalc.mean()
MeanRRH = Entries1[key].hRideR.mean()
# There's 30 more lines of these...
# Then the list is updated with this:
EntrySummary.append({'Turn Number': key, 'Avg FRH': MeanFRH, 'Avg FRHC': MeanFRHC, 'Avg RRH': MeanRRH,... # and so on})
EntrySummary = pd.DataFrame(EntrySummary)
EntrySummary.index = EntrySummary['Turn Number']
del EntrySummary['Turn Number']
This is the old code. What I've tried to do is this:
EntrySummary = []
for i in headers():
EntrySummary.append({'Turn Number': key, str('Avg '[i]): str('Mean'[i])})
print EntrySummary
# The print is only there for me to see if it's worked.
However I'm getting this error at the minute:
for i in headers():
TypeError: 'Index' object is not callable
Any ideas as to where I've made a mistake? I've probably made a few...
Thank you in advance
Oli
If I'm understanding your situation correctly, you want to replace the long series of assignments in the "old code" you've shown with another loop that processes all of the different items automatically using the list of headers from your data files.
I think this is what you want:
EntrySummary = []
for key, value in Entries1.viewitems():
entry = {"Turn Number": key}
for header in headers:
entry["Avg {}".format(header)] = getattr(value, header).mean()
EntrySummary.append(entry)
You might be able to come up with some better variables names, since you know what the keys and values in Entries1 are (I do not, so I used generic names).

Categories