Selecting columns based on external list/data in python - python

I have a data set with various Region map variables(around 1000). Sample data looks like:
Userid regionmap1 regionmap2 regionmap3 and so on.
78 7 na na
45 na na na
67 1 na na
Here the number in regionmap variables represent the number of views. Now I have an external file with only 10 region map entries. The file contains 10 entries/rows with 10 different region map variables:
Regionmap1
Regionmap3
Regionmap7
.....
.....
Regionmap856.
So my task is to keep only these regionmap variables as columns in the original file and delete all the other 990 columns. So the final data should look like:
Userid Regionmap1 regionmap3 regionmap7 ........ regionmap856
78 7 na na na
45 na na na na
67 1 na na na
It would be great if anyone can provide me help in this regard in Python.

This is pretty easy to do. What have you tried?
Here's a general procedure to help you get started:
1 - open the smaller file w/ the regionmaps you want to keep and readline those into a list.
2 - open the larger file and create a dictionary of lists to contain the data. You can think of the dict's keys as basically column headers. The values are lists that represent the column values for all your records.
3 - now, remove kvps from your dict where the key is not in your list from step 1 or is not userid.
4 - use resulting dict to write out a new file.
Definitely not the only approach, buts it's a simple one that you should be able to start with. Hope that helps :)

I have a solution adapted for your problem.
You can perform to make the file look better.
import StringIO
import numpy as np
# Preparing an object that simulates a file (f is the file)
f = StringIO.StringIO()
f.write("""Userid regionmap1 regionmap2 regionmap3
78 7 na na
45 na na na
67 1 na na""")
f.seek(0)
# Reading file and getting the header (1st line)
head = f.readline().strip("\n").split()
data = []
for a in f:
data.append([float(e) for e in a.replace('na', 'NaN').split()])
#
data = np.array(data)
# Columns to keep
s = ("Regionmap1", "Regionmap3")
s = map(lambda e: e.lower(), s)
s = ["Userid",] + s
# Index of the columns to keep
idx, = np.where([e in s for e in head])
# Saving the new data in a file (simulated with StringIO)
ff = StringIO.StringIO()
ff.write(' '.join(tuple(s)) + '\n')
np.savetxt(ff, data[:, idx])
The rendered file looks like:
Userid regionmap1 regionmap3
7.800000000000000000e+01 7.000000000000000000e+00 nan
4.500000000000000000e+01 nan nan
6.700000000000000000e+01 1.000000000000000000e+00 nan

Try dis! Dis code is to form the dictionary with headers as key and the list of column values as values
f = open('2.txt', 'r') #opening the large file
data = f.readlines()
f.close()
hdrs = data[0].split('\t') #assuming that large file is tab separated, and the first line is header line
data_dict = {} #main data
for each_line in data[1:]: #starting from second line as the first line is header line
splitdata = each_line.split('\t') #splitting the line with tab
for i, d in enumerate(splitdata):
tmpval = data_dict.get(hdrs[i], [])
tmpval.append(d)
data_dict[hdrs[i]] = tmpval #appending the column value for its respective header
for k, v in data_dict.items(): #printing the final data dict
print k, v

Related

Blank column appearing in .csv output, how can I remove it?

*Updated to add more lines of input file
I have a .csv file with header and subsequent data as follows (shown only first few rows here):
gene_name VarXCRep.1 VarX1Rep.1 VarX2Rep.1 VarXCRep.2 VarX3Rep.2 VarX1Rep.2 VarX2Rep.2 VarXCRep.3 VarX3Rep.3 VarX1Rep.3 VarX2Rep.3
1 Soltu.DM.01G000010 360.7000522 395.2279977 323.2595994 361.5910696 327.7380499 386.8290979 336.3997167 333.0843759 317.4954424 377.756613 396.666783
2 Soltu.DM.01G000020 91.12422371 69.30538348 77.36127164 135.060696 61.85252412 110.6099 68.21624475 108.7053612 55.31681029 56.52040232 36.14709293
3 Soltu.DM.01G000030 439.1681337 183.5656103 232.0838149 579.546161 220.9018719 179.6646995 179.2348391 291.2746216 222.4196747 266.8621527 208.321404
4 Soltu.DM.01G000040 268.3102142 185.4387288 192.0217278 301.5640936 130.9345641 237.108515 203.9799475 236.921941 92.19468382 198.1791322 38.04957151
5 Soltu.DM.01G000050 341.7158389 479.5183289 504.229717 322.2876925 528.5579334 390.4957244 470.1570594 342.8399852 554.3205365 424.9761896 634.4766049
6 Soltu.DM.01G000060 468.2772607 839.1570756 759.7982036 514.516937 886.0173261 572.6048416 579.8380803 549.1014398 1011.836655 598.8300854 1077.754113
7 Soltu.DM.01G000070 2.531228436 0 5.525805117 1.429213714 8.032795341 1.83331326 5.350293706 0 4.609734191 0 7.609914302
8 Soltu.DM.01G000090 84.79615262 54.3204357 75.97982036 98.61574626 102.0165008 83.11020113 84.26712586 108.7053612 98.53306833 80.13019064 93.2214502
9 Soltu.DM.01G000100 67.07755356 73.05162042 12.43306151 118.6247383 6.426236273 77.61026135 36.11448251 97.55609336 8.643251608 67.25212429 15.2198286
10 Soltu.DM.01G000110 1.265614218 0 1.381451279 2.143820571 0 1.22220884 4.012720279 0 2.304867095 0.715448131 0.951239288
11 Soltu.DM.01G000120 821.3836276 451.4215518 846.8296342 820.3686718 737.4106123 497.4389979 835.9833915 798.5663071 752.5391067 704.7164087 532.6940011
12 Soltu.DM.01G000130 2.531228436 3.746236945 5.525805117 2.143820571 0.803279534 0.61110442 2.00636014 1.393658477 1.728650322 2.146344392 10.46363217
13 Soltu.DM.01G000140 93.65545214 127.3720561 102.2273947 105.7618148 104.4263394 108.7765868 115.7001014 98.94975183 108.9049703 110.8944603 126.5148253
14 Soltu.DM.01G000150 112.6396654 84.29033126 91.17578444 86.46742969 154.2296705 99.61002047 111.0185944 115.6736536 111.7860541 115.187149 163.6131575
15 Soltu.DM.01G000160 644.197637 573.1742525 222.413656 760.3416958 178.3280566 761.4361074 594.551388 1053.605808 222.4196747 585.2365709 303.4453328
16 Soltu.DM.01G000170 751.7748456 841.0301941 910.3763931 773.9192261 835.4107154 820.7132361 1148.975573 804.140941 849.3435247 710.4399938 946.4830913
17 Soltu.DM.01G000190 6.328071091 1.873118472 5.525805117 6.431461713 8.836074875 5.49993978 8.694227272 11.14926781 4.609734191 7.869929438 0.951239288
18 Soltu.DM.01G000200 88.59299527 73.05162042 66.30966141 74.31911313 63.45908319 78.83247019 74.23532517 86.40682554 59.35032771 59.38219485 44.70824652
19 Soltu.DM.01G000210 108.8428228 112.3871083 85.64997932 111.4786697 73.0984376 123.4430928 113.6937412 143.5468231 67.41736254 77.26839812 86.56277518
20 Soltu.DM.01G000220 5.062456873 86.16344973 93.938687 20.72359885 507.6726655 30.555221 24.74510839 6.968292383 551.4394526 54.37405793 920.7996305
This is how the file appears in Bash shell
gene_name,VarXCRep.1,VarX1Rep.1,VarX2Rep.1,VarXCRep.2,VarX3Rep.2,VarX1Rep.2,VarX2Rep.2,VarXCRep.3,VarX3Rep.3,VarX1Rep.3,VarX2Rep.3
Soltu.DM.01G000010,360.7000522,395.2279977,323.2595994,361.5910696,327.7380499,386.8290979,336.3997167,333.0843759,317.4954424,377.756613,396.666783
Soltu.DM.01G000020,91.12422371,69.30538348,77.36127164,135.060696,61.85252412,110.6099,68.21624475,108.7053612,55.31681029,56.52040232,36.14709293
Soltu.DM.01G000030,439.1681337,183.5656103,232.0838149,579.546161,220.9018719,179.6646995,179.2348391,291.2746216,222.4196747,266.8621527,208.321404
Soltu.DM.01G000040,268.3102142,185.4387288,192.0217278,301.5640936,130.9345641,237.108515,203.9799475,236.921941,92.19468382,198.1791322,38.04957151
Soltu.DM.01G000050,341.7158389,479.5183289,504.229717,322.2876925,528.5579334,390.4957244,470.1570594,342.8399852,554.3205365,424.9761896,634.4766049
Soltu.DM.01G000060,468.2772607,839.1570756,759.7982036,514.516937,886.0173261,572.6048416,579.8380803,549.1014398,1011.836655,598.8300854,1077.754113
Soltu.DM.01G000070,2.531228436,0,5.525805117,1.429213714,8.032795341,1.83331326,5.350293706,0,4.609734191,0,7.609914302
Soltu.DM.01G000090,84.79615262,54.3204357,75.97982036,98.61574626,102.0165008,83.11020113,84.26712586,108.7053612,98.53306833,80.13019064,93.2214502
Soltu.DM.01G000100,67.07755356,73.05162042,12.43306151,118.6247383,6.426236273,77.61026135,36.11448251,97.55609336,8.643251608,67.25212429,15.2198286
I was asked to remove various types of columns and associated data which I have done successfully in the following code. I was then asked to arrange the data such that the headers show control (VarXC) repeats 1, 2 and 3 and experiment 1 (VarX1) repeats in columns next to each other which also has been done in the following code:
empty_list = []
for ln in open("FinalXVartest.csv").readlines():
col = ln.split(",")
del col[3]
del col[4]
del col[5]
del col[6]
del col[7]
col.append(col.pop(2))
col.append(col.pop(3))
col.append(col.pop(4))
empty_list += col
empty_list += '\n'
file_out = open("Xtest_2Var.csv", "w")
file_out.write(','.join(empty_list))
file_out.close()
When I try to compile all this information, the output shows up like this:
This is the final output
I am not sure how I am getting that space on the left side. Can someone help me remove so that all the rows shift by one cell to the left?
You should change the code a little bit to make it work as you expect. The problem with your code is that you are constructing a single list to which you add EOL \n as elements. Therefore, when you write this list to a file
file_out.write(','.join(empty_list))
there will be a comma after each line break. I construct a list of lists and add \n right after join to avoid your problem:
empty_list = []
for ln in open("files/FinalXVartest.csv").readlines():
col = ln.split(",")
del col[3]
del col[4]
del col[5]
del col[6]
del col[7]
col.append(col.pop(2))
col.append(col.pop(3))
col.append(col.pop(4))
empty_list.append(col)
file_out = open("files/Xtest_2Var.csv", "w")
for item in empty_list:
file_out.write(','.join(item) + '\n')
file_out.close()
But it's better to use csv library. It is suitable for reading and writing csv files.
Using pandas:
import pandas as pd
import re
df = pd.read_csv('FinalXVartest.csv', index_col='gene_name')
parsed = sorted([(re.match(r'VarX(.)Rep.(\d)', k).groups()[::-1], k) for k in df.columns])
cols = [k for (i, j), k in parsed if j in {'1', 'C'}]
df.to_csv('Xtest_2Var.csv')
>>> df[cols]
VarX1Rep.1 VarXCRep.1 VarX1Rep.2 VarXCRep.2 VarX1Rep.3 VarXCRep.3
gene_name
Soltu.DM.01G000010 395.227998 360.700052 386.829098 361.591070 377.756613 333.084376
Soltu.DM.01G000020 69.305383 91.124224 110.609900 135.060696 56.520402 108.705361
Soltu.DM.01G000030 183.565610 439.168134 179.664700 579.546161 266.862153 291.274622
Soltu.DM.01G000040 185.438729 268.310214 237.108515 301.564094 198.179132 236.921941
Soltu.DM.01G000050 479.518329 341.715839 390.495724 322.287692 424.976190 342.839985
Soltu.DM.01G000060 839.157076 468.277261 572.604842 514.516937 598.830085 549.101440
Soltu.DM.01G000070 0.000000 2.531228 1.833313 1.429214 0.000000 0.000000
Soltu.DM.01G000090 54.320436 84.796153 83.110201 98.615746 80.130191 108.705361
Soltu.DM.01G000100 73.051620 67.077554 77.610261 118.624738 67.252124 97.556093

How to read space delimited data, two row types, no fixed width and plenty of missing values?

There's lots of good information out there on how to read space-delimited data with missing values if the data is fixed-width.
http://jonathansoma.com/lede/foundations-2017/pandas/opening-fixed-width-files/
Reading space delimited file in Python/Pandas with missing values
ASCII table with consecutive white-spaces as separators and missing data python pandas
I'm currently trying to read Japan's Meteorological Agency typhoon history data which is supposed to have this format, but doesn't actually:
# Header rows:
5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80
::::+::::|::::+::::|::::+::::|::::+::::|::::+::::|::::+::::|::::+::::|::::+::::|
AAAAA BBBB CCC DDDD EEEE F G HHHHHHHHHHHHHHHHHHHH IIIIIIII
# Data rows:
5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80
::::+::::|::::+::::|::::+::::|::::+::::|::::+::::|::::+::::|::::+::::|::::+::::|
AAAAAAAA BBB C DDD EEEE FFFF GGG HIIII JJJJ KLLLL MMMM P
It is very similar to NOAA's hurricane best track data, except that it comma delimited, and missing values were given -999 or NaN, which simplified reading the data. Additionally, Japan's data doesn't actually follow the advertised format. For example, column FFFF in the data rows don't always have width 4. Sometimes it has width 3.
I must say that I'm at a complete loss as how to process this data into a dataframe. I've investigated the pd.read_fwf method, and it initially looked promising until I discovered the malformed columns and the two different row types.
My question:
How can I approach cleaning this data and getting it into a dataframe? I'd just find a different dataset, but honestly I can't find any comprehensive typhoon data anywhere else.
I went a little deep for you here, because I'm assuming you're doing this in the name of science and if I can help someone trying to understand climate change then its a good cause.
After looking the data over I've noticed the issue is relating to the data being stored in a de-normalized structure. There are 2 ways you can approach this issue off the top of my head. Re-Writing the file to another file to load into pandas or dask is what I'll show, since thats probably the easiest way to think about it (but certainly not the most efficient for those that will inevitably roast me in the comments)
Think of this like its Two Separate Tables, with a 1-to-Many relationship. 1 table for Typhoons and another for the data belonging to a given typhoon.
A decent, but not really efficient way would be to rewrite it to a better nested structure, like JSON. Then load the data in using that. Note the 2 distinct types of columns.
Step 1: map the data out
There are really 2 tables in one table here. Each typhoon is going to show up as a row that appears like this:
66666 9119 150 0045 9119 0 6 MIRREILE 19920701
While the records for that typhoon are going to follow that row (think of this as a separate row:
20080100 002 3 178 1107 994 035 00000 0000 30600 0200
Load the File in, reading it as raw lines. By using the .readlines() method, we can read each individual line in as an item in a list.
# load the file as raw input
with open('./test.txt') as f:
lines = f.readlines()
Now that we have that read in, we're going to need to perform some logic to separate some lines from others. It appears the every time there is a Typhoon record, the line is preceded with a '66666', so lets key off that. So, given we look at each individual line in a horribly inefficient loop, we can write some if/else logic to have a look:
if row[:5] == '66666':
# do stuff
else:
# do other stuff
Thats going to be a pretty solid way to separate that logic for now, which will be useful to guide splitting that up. Now, we need to write a loop that will check that for each row:
# initialize list of dicts
collection = []
def write_typhoon(row: str, collection: Dict) -> Dict:
if row[:5] == '66666':
# do stuff
else:
# do other stuff
# read through lines list from the .readlines(), looping sequentially
for line in lines:
write_typhoon(line, collection)
Lastly, we're going to need to write some logic to now extract that data out in some manner within the if/then loop inside the write_typhoon() function. I didn't care to do a whole lot of thinking here, and opted for the simplest I could make it: defining the fwf metadata myself. because "yolo":
def write_typhoon(row: str, collection: Dict) -> Dict:
if row[:5] == '66666':
typhoon = {
"AA":row[:5],
"BB":row[6:11],
"CC":row[12:15],
"DD":row[16:20],
"EE":row[21:25],
"FF":row[26:27],
"GG":row[28:29],
"HH":row[30:50],
"II":row[51:],
"data":[]
}
# clean that whitespace
for key, value in typhoon.items():
if key != 'data':
typhoon[key] = value.strip()
collection.append(typhoon)
else:
sub_data = {
"A":row[:9],
"B":row[9:12],
"C":row[13:14],
"D":row[15:18],
"E":row[19:23],
"F":row[24:32],
"G":row[33:40],
"H":row[41:42],
"I":row[42:46],
"J":row[47:51],
"K":row[52:53],
"L":row[54:57],
"M":row[58:70],
"P":row[71:]
}
# clean that whitespace
for key, value in sub_data.items():
sub_data[key] = value.strip()
collection[-1]['data'].append(sub_data)
return collection
Okay that took me longer than I'm willing to admit. I wont lie. Gave me PTSD flashbacks from writing COBOL programs...
Anyway, now we have a nice, nested data structure in native python types. The fun can begin!
Step 2: Load this into a usable format
To analyze it, I'm assuming you'll want it in pandas (or maybe Dask if its too big). Here is what I was able to come up with along that front:
import pandas as pd
df = pd.json_normalize(
collection,
record_path='data',
meta=["AA","BB","CC","DD","EE","FF","GG","HH","II"]
)
A great reference for that can be found in the answers for this question (particularly the second one, not the selected one)
Put it all together now:
from typing import Dict
import pandas as pd
# load the file as raw input
with open('./test.txt') as f:
lines = f.readlines()
# initialize list of dicts
collection = []
def write_typhoon(row: str, collection: Dict) -> Dict:
if row[:5] == '66666':
typhoon = {
"AA":row[:5],
"BB":row[6:11],
"CC":row[12:15],
"DD":row[16:20],
"EE":row[21:25],
"FF":row[26:27],
"GG":row[28:29],
"HH":row[30:50],
"II":row[51:],
"data":[]
}
for key, value in typhoon.items():
if key != 'data':
typhoon[key] = value.strip()
collection.append(typhoon)
else:
sub_data = {
"A":row[:9],
"B":row[9:12],
"C":row[13:14],
"D":row[15:18],
"E":row[19:23],
"F":row[24:32],
"G":row[33:40],
"H":row[41:42],
"I":row[42:46],
"J":row[47:51],
"K":row[52:53],
"L":row[54:57],
"M":row[58:70],
"P":row[71:]
}
for key, value in sub_data.items():
sub_data[key] = value.strip()
collection[-1]['data'].append(sub_data)
return collection
# read through file sequentially
for line in lines:
write_typhoon(line, collection)
# load to pandas df using json_normalize
df = pd.json_normalize(
collection,
record_path='data',
meta=["AA","BB","CC","DD","EE","FF","GG","HH","II"]
)
print(df.head(20)) # lets see what we've got!
There's someone who might have had the same problem and created a library for it, you can check it out here:
https://github.com/miniufo/besttracks
It also includes a quickstart notebook with loading the same dataset.
Here is how I ended up doing it. The key was realizing there are two types of rows in the data, but within each type the columns are fixed width:
header_fmt = "AAAAA BBBB CCC DDDD EEEE F G HHHHHHHHHHHHHHHHHHHH IIIIIIII"
track_fmt = "AAAAAAAA BBB C DDD EEEE FFFF GGG HIIII JJJJ KLLLL MMMM P"
So, here's how it went. I wrote these two functions to help me reformat the text file int CSV format:
def get_idxs(string, char):
idxs = []
for i in range(len(string)):
if string[i - 1].isalpha() and string[i] == char:
idxs.append(i)
return idxs
def replace(string, idx, replacement):
string = list(string)
try:
for i in idx: string[i] = replacement
except TypeError:
string[idx] = replacement
return ''.join(string)
# test it out
header_fmt = "AAAAA BBBB CCC DDDD EEEE F G HHHHHHHHHHHHHHHHHHHH IIIIIIII"
track_fmt = "AAAAAAAA BBB C DDD EEEE FFFF GGG HIIII JJJJ KLLLL MMMM P"
header_idxs = get_idxs(header_fmt, ' ')
track_idxs = get_idxs(track_fmt, ' ')
print(replace(header_fmt, header_idxs, ','))
print(replace(track_fmt, track_idxs, ','))
Testing the function on the format strings, we see commas were put in the appropriate places:
AAAAA,BBBB, CCC,DDDD,EEEE,F,G,HHHHHHHHHHHHHHHHHHHH, IIIIIIII
AAAAAAAA,BBB,C,DDD,EEEE,FFFF, GGG, HIIII,JJJJ,KLLLL,MMMM, P
So next apply those functions to the .txt and create a .csv file with the output:
from contextlib import ExitStack
from tqdm.notebook import tqdm
with ExitStack() as stack:
read_file = stack.enter_context(open('data/bst_all.txt', 'r'))
write_file = stack.enter_context(open('data/bst_all_clean.txt', 'a'))
for line in tqdm(read_file.readlines()):
if ' ' in line[:8]: # line is header data
write_file.write(replace(line, header_idxs, ',') + '\n')
else: # line is track data
write_file.write(replace(line, track_idxs, ',') + '\n')
The next task is to add the header data to ALL rows, so that all rows have the same format:
header_cols = ['indicator', 'international_id', 'n_tracks', 'cyclone_id', 'international_id_dup',
'final_flag', 'delta_t_fin', 'name', 'last_revision']
track_cols = ['date', 'indicator', 'grade', 'latitude', 'longitude', 'pressure', 'max_wind_speed',
'dir_long50', 'long50', 'short50', 'dir_long30', 'long30', 'short30', 'jp_landfall']
data = pd.read_csv('data/bst_all_clean.txt', names=track_cols, skipinitialspace=True)
data.date = data.date.astype('string')
# Get headers. Header rows have variable 'indicator' which is 5 characters long.
headers = data[data.date.apply(len) <= 5]
data[['storm_id', 'records', 'name']] = headers.iloc[:, [1, 2, 7]]
# Rearrange columns; bring identifiers to the first three columns.
cols = list(data.columns[-3:]) + list(data.columns[:-3])
data = data[cols]
# front fill NaN's for header data
data[['storm_id', 'records', 'name']] = data[['storm_id', 'records', 'name']].fillna(method='pad')
# delete now extraneous header rows
data = data.drop(headers.index)
And that yields some nicely formatted data, like this:
storm_id records name date indicator grade latitude longitude
15 5102.0 37.0 GEORGIA 51031900 2 2 67.0 1614
16 5102.0 37.0 GEORGIA 51031906 2 2 70.0 1625
17 5102.0 37.0 GEORGIA 51031912 2 2 73.0 1635

Finding the row number for the header row in a CSV file / Pandas Dataframe

I am trying to get an index or row number for the row that holds the headers in my CSV file.
The issue is, the header row can move up and down depending on the output of the report from our system (I have no control to change this)
code:
ht = pd.read_csv(file.csv)
test = ht.get_loc('Code') #Code being header im using to locate the header row
csv1 = read_csv(file.csv, header=test)
df1 = df1.append(csv1) #Appending as have many files
If I was to print test, I would expect a number around 4 or 5, and that's what I am feeding into the second read "read_csv"
The error I'm getting is that it's expecting 1 header column, but I have 26 columns. I am just trying to use the first header string to get the row number
Thanks
:-)
Edit:
CSV format
This file contains the data around the volume of items blablalbla
the deadlines for delivery of items a - z is 5 days
the deadlines for delivery of items aa through zz are 3 days
the deadlines for delivery of items aaa through zzz are 1 days
code,type,arrived_date,est_del_date
a/wrwgwr12/001,kids,12-dec-18,17-dec-18
aa/gjghgj35/030,pet,15-dec-18,18-dec-18
as you will see the "The deadlines" rows are the same, this can be 3 or 5 based on the code ids, thus the header row can change up or down.
I also did not write out all 26 column headers, not sure that matters.
Wanted DF format
index | code | type | arrived_date | est_del_date
1 | a/wrwgwr12/001 | kids | 12-dec-18 | 17-dec-18
2 | aa/gjghgj35/030 | Pet | 15-dec-18 | 18-dec-18
Hope this makes sense..
Thanks,
You can use the csv module to find the first row which contains a delimiter, then feed the index of this row as the skiprows parameter to pd.read_csv:
from io import StringIO
import csv
import pandas as pd
x = """This file contains the data around the volume of items blablalbla
the deadlines for delivery of items a - z is 5 days
the deadlines for delivery of items aa through zz are 3 days
the deadlines for delivery of items aaa through zzz are 1 days
code,type,arrived_date,est_del_date
a/wrwgwr12/001,kids,12-dec-18,17-dec-18
aa/gjghgj35/030,pet,15-dec-18,18-dec-18"""
# replace StringIO(x) with open('file.csv', 'r')
with StringIO(x) as fin:
reader = csv.reader(fin)
idx = next(idx for idx, row in enumerate(reader) if len(row) > 1) # 4
# replace StringIO(x) with 'file.csv'
df = pd.read_csv(StringIO(x), skiprows=idx)
print(df)
code type arrived_date est_del_date
0 a/wrwgwr12/001 kids 12-dec-18 17-dec-18
1 aa/gjghgj35/030 pet 15-dec-18 18-dec-18

After a desire ouput Not able to print date from both date/time

Hello Experts i have a program that read the csv file which contain several columns main motive of this program is to convert the string into seuence of number and duplicated string will be the same number which have taken this all operation i can able to perform but I want my date/time column to print only date for that i applied a slicing method that's work in console but I'm not able to to print it on my other csv file. Please tell me what to do.
This is the program I have written:
import pandas as pd
import csv
import os
# from io import StringIO
# tempFile="input1.csv"
with open("input1.csv", 'r',encoding="utf8") as csvfile:
# creating a csv reader object
reader = csv.DictReader(csvfile, delimiter=',')
# next(reader, None)
'''We then restructure the data to be a set of keys with list of values {key_1: [], key_2: []}:'''
data = {}
for row in reader:
# print(row)
for header, value in row.items():
try:
data[header].append(value)
except KeyError:
data[header] = [value]
'''Next we want to give each value in each list a unique identifier.'''
# Loop through all keys
for key in data.keys():
values = data[key]
things = list(sorted(set(values), key=values.index))
for i, x in enumerate(data[key]):
if key=="Date/Time":
var = data[key]
iter_obj1 = iter(var)
while True:
try:
element1 = next(iter_obj1)
date =element1[0:10]
print("date-",date)
except StopIteration:
break
break
else:
# if key == "Date/Time" :
# print(x[0:10])
# continue
data[key][i] = things.index(x) + 1
print('data.[keys]()-',data[key])
print('data.keys()-',data.keys())
print('values-',values)
print('data.keys()-',key)
print('x-',x)
print('i-',i)
# print("FullName-",FullName)
"""Since csv.writerows() takes a list but treats it as a row, we need to restructure our
data so that each row is one value from each list. This can be accomplished using zip():"""
with open("ram3.csv", "w") as outfile:
writer = csv.writer(outfile)
# Write headers
writer.writerow(data.keys())
# Make one row equal to one value from each list
rows = zip(*data.values())
# Write rows
writer.writerows(rows)
Note: I can't use pandas DataFrame. That's why I have written code like this please tell me how to print my date/time column only date where i need to change in code to get that...thanks
Input:
job_Id Name Address Email Date/Time
1 snehil singh marathalli ss#gmail.com 12/10/2011:02:03:20
2 salman marathalli ss#gmail.com 12/11/2011:03:10:20
3 Amir HSR ar#gmail.com 11/02/2009:09:03:20
4 Rakhesh HSR rakesh#gmail.com 09/12/2010:02:03:55
5 Ram marathalli r#gmail.com 01/10/2014:12:03:20
6 Shyam BTM ss#gmail.com 12/11/2012:01:03:20
7 salman HSR ss#gmail.com 11/08/2016:15:03:20
8 Amir BTM ar#gmail.com 07/10/2013:04:02:30
9 snehil singh Majestic sne#gmail.com 03/03/2018:02:03:20
Csv file:
job_Id Name Address Email Date/Time
1 1 1 1 12/10/2011:02:03:20
2 2 1 1 12/11/2011:03:10:20
3 3 2 2 11/02/2009:09:03:20
4 4 2 3 09/12/2010:02:03:55
5 5 1 4 01/10/2014:12:03:20
6 6 3 1 12/11/2012:01:03:20
7 2 2 1 11/08/2016:15:03:20
8 3 3 2 07/10/2013:04:02:30
9 1 4 5 03/03/2018:02:03:20
In this output, everything is correct but only the date/time column. I want to print date only, and not time.
if key=="Date/Time":
var="12/10/2011"
print(var)
var = data[key]
iter_obj1 = iter(var)
while True:
try:
element1 = next(iter_obj1)
date =element1[0:10]
print("date-",date)
except StopIteration:
break
i got it i should not use all these...things just added one line will print the desired output in the for loop..
if key=="Date/Time":
data[key][i] = data[key][i][0:10]
that's it.. its done all will be same..

manipulating two values of a key in dictionary at the same time

i am reading a file which is in the format below:
0.012281001 00:1c:c4:c2:1f:fe 1 30
0.012285001 00:1c:c4:c2:1f:fe 3 40
0.012288001 00:1c:c4:c2:1f:fe 2 50
0.012292001 00:1c:c4:c2:1f:fe 4 60
0.012295001 24:1c:c4:c2:2f:ce 5 70
I intend to make column 2 entities as keys and columns 3 and 4 as separate values. For each line I encounter, for that particular key, their respective values must add up (value 1 and value 2 should aggregate separately for that key). In the above example mentioned, I need to get the output like this:
'00:1c:c4:c2:1f:fe': 10 : 180, '24:1c:c4:c2:2f:ce': 5 : 70
The program i have written for simple 1 key 1 value is as below:
#!/usr/bin/python
import collections
result = collections.defaultdict(int)
clienthash = dict()
with open("luawrite", "r") as f:
for line in f:
hashes = line.split()
ckey = hashes[1]
val1 = float(hashes[2])
result[ckey] += val1
print result
How can I extend this for 2 values and how can I print them as the output mentioned above. I am not getting any ideas. Please help! BTW i am using python2.6
You can store all of the values in a single dictionary, using a tuple as the stored value:
with open("luawrite", "r") as f:
for line in f:
hashes = line.split()
ckey = hashes[1]
val1 = int(hashes[2])
val2 = int(hashes[3])
a,b = result[ckey]
result[ckey] = (a+val1, b+val2)
print result

Categories