I have a data in a file I dont know if it is delimited by space or tab
Data In:
id Name year Age Score
123456 ALEX BROWNNIS VND 0 19 115
123457 MARIA BROWNNIS VND 0 57 170
123458 jORDAN BROWNNIS VND 0 27 191
I read it the data with read_csv and using the tab delimited
df = pd.read_csv(data.txt,sep='\t')
out:
id Name year Age Score
0 123456 ALEX BROWNNIS VND ... 0 19 115
1 123457 MARIA BROWNNIS VND ... 0 57 170
2 123458 jORDAN BROWNNIS VND ... 0 27 191
There is a lot of a white spaces between the column. Am I using delimiter correctly? and when I try to process the column name, I gotkey error so I basically think the fault is use of \t.
What are the possible way to fix this problem?
Since you have two columns and the second one has variable number of words, you need to read it as a regular file and then combine second to last words.
id = []
Name = []
year = []
Age = []
Score = []
with open('data.txt') as f:
text = f.read()
lines = text.split('\n')
for line in lines:
if len(line) < 3: continue
words = line.split()
id.append(words[0])
Name.append(' '.join(words[1:-3]))
year.append(words[-3])
Age.append(words[-2])
Score.append(words[-1])
df = pd.DataFrame.from_dict({'id': id, 'Name': Name,
'year': year, 'Age': Age, 'Score': Score})
Edit: you'd posted the overall data, so I'll change my answer to fit it.
You can use the skipinitialspace parameter like in the following example.
df2 = pd.read_csv('data.txt', sep='\t', delimiter=',', encoding="utf-8", skipinitialspace=True)
Pandas documentation: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
Problem Solved:
df = pd.read_csv('data.txt', sep='\t',engine="python")
I added this line of code to remove space between columns and it's work
df.columns = df.columns.str.strip()
Related
I'm using Vote History data from the Secretary of State, however the .txt file they gave me is 7 million rows, where each row is a string with 27 characters. The first 3 characters are a code for the county. The next 8 characters are the registration ID, the next 8 characters are the date voted, etc. I can't do text to columns in excel because the file is too big. Is there a way to separate this file into columns in python pandas?
Example
Currently I have:
0010000413707312012026R
0010000413708212012027R
0010000413711062012029
0010004535307312012026D
I want to have columns:
001 00004137 07312012 026 R
001 00004137 08212012 027 R
001 00004137 11062012 029
001 00045353 07312012 026 D
Where each space separates a new column. Any suggestions? Thanks.
Simplest I can make it:
import pandas as pd
sample_lines = ['0010000413707312012026R','0010000413708212012027R','0010000413711062012029','0010004535307312012026D]']
COLUMN_NAMES = ['A','B','C','D','E']
df = pd.DataFrame(columns=COLUMN_NAMES)
for line in sample_lines:
row = [line[0:3], line[3:11], line[11:19], line[19:22], line[22:23]]
df.loc[len(df)] = row
print (df)
Outputs:
A B C D E
0 001 00004137 07312012 026 R
1 001 00004137 08212012 027 R
2 001 00004137 11062012 029
3 001 00045353 07312012 026 D
try this:
I think you don't have issue reading form txt file,simplified case would be like here:
a=['0010000413707312012026R','0010000413708212012027R','0010000413711062012029','0010004535307312012026D']
area=[]
date=[]
e1=[]
e2=[]
e3=[]
#001 00004137 07312012 026 R
for i in range (0,len(a)):
area.append(a[i][0:3])
date.append(a[i][3:11])
e1.append(a[i][11:19])
e2.append(a[i][19:22])
e3.append(a[i][22:23])
all_list = pd.DataFrame(
{'area': area,
'date': date,
'e1': e1,
'e2': e2,
'e3': e3
})
print(all_list )
#save as CSV file
all_list.to_csv('all.csv')
Since the file is too big, its better to read and save it into a different file, instead of read the entire file in memory:
with open('temp.csv') as f:
for line in f:
code = line[0:3]
registration = line[3:11]
date = line[11:19]
second_code = line[19:22]
letter = line[22:]
with open('modified.csv', 'a') as f2:
f2.write(
' '.join([code, registration, date, second_code, letter]))
You can also read the content to from the txt file and use extract to divide the dataframe columns
df = pd.read_csv('temp.csv', header=None)
df
# 0
# 0 0010000413707312012026R
# 1 0010000413708212012027R
# 2 0010000413711062012029
# 3 0010004535307312012026D
df = df[df.columns[0]].str.extract('(.{3})(.{8})(.{8})(.{3})(.*)')
df
# 0 1 2 3 4
# 0 001 00004137 07312012 026 R
# 1 001 00004137 08212012 027 R
# 2 001 00004137 11062012 029
# 3 001 00045353 07312012 026 D
I have a file with data similar to this:
[START]
Name = Peter
Sex = Male
Age = 34
Income[2020] = 40000
Income[2019] = 38500
[END]
[START]
Name = Maria
Sex = Female
Age = 28
Income[2020] = 43000
Income[2019] = 42500
Income[2018] = 40000
[END]
[START]
Name = Jane
Sex = Female
Age = 41
Income[2020] = 60500
Income[2019] = 57500
Income[2018] = 54000
[END]
I want to read this data into a pandas dataframe so that at the end it is similar to this
Name Sex Age Income[2020] Income[2019] Income[2018]
Peter Male 34 40000 38500 NaN
Maria Female 28 43000 42500 40000
Jane Female 41 60500 57500 54000
So far, I wasn't able to figure out if this is a standard data file format (it has some similarities to JSON but is still very different).
Is there an elegant and fast way to read this data to a dataframe?
Elegant I do not know, but easy way, yes. Python is very good at parsing simple formatted text.
Here, [START] starts a new record, [END] ends it, and inside a record, you have key = value lines. You can easily build a custom parser to generate a list of records to feed into a pandas DataFrame:
inblock = False
fieldnames = []
data = []
for line in open(filename):
if inblock:
if line.strip() == '[END]':
inblock = False
elif '=' in line:
k, v = (i.strip() for i in line.split('=', 1))
record[k] = v
if not k in fieldnames:
fieldnames.append(k)
else:
if line.strip() == '[START]':
inblock = True
record = {}
data.append(record)
df = pd.DataFrame(data, columns=fieldnames)
df is as expected:
Name Sex Age Income[2020] Income[2019] Income[2018]
0 Peter Male 34 40000 38500 NaN
1 Maria Female 28 43000 42500 40000
2 Jane Female 41 60500 57500 54000
file1.txt contains usernames, i.e.
tony
peter
john
...
file2.txt contains user details, just one line for each user details, i.e.
alice 20160102 1101 abc
john 20120212 1110 zjc9
mary 20140405 0100 few3
peter 20140405 0001 io90
tango 19090114 0011 n4-8
tony 20150405 1001 ewdf
zoe 20000211 0111 jn09
...
I want to get a shortlist of user details from file2.txt by file1.txt user provided, i.e.
john 20120212 1110 zjc9
peter 20140405 0001 io90
tony 20150405 1001 ewdf
How to use python to do this?
You can use .split(' '), assuming that there will always be a space between the name and the other information in the file2.txt
Here's an example:
UserList = []
with open("file1.txt","r") as fuser:
UserLine = fuser.readline()
while UserLine!='':
UserList.append(UserLine.split("\n")[0]) # Separate the user name from the new line command in the text file.
UserLine = fuser.readline()
InfoUserList = []
InfoList = []
with open("file2.txt","r") as finfo:
InfoLine = finfo.readline()
while InfoLine!='':
InfoList.append(InfoLine)
line1 = InfoLine.split(' ')
InfoUserList.append(line1[0]) # Take just the user name to compare it later
InfoLine = finfo.readline()
for user in UserList:
for i in range(len(InfoUserList)):
if user == InfoUserList[i]:
print InfoList[i]
You can use pandas:
import pandas as pd
file1 = pd.read_csv('file1.txt', sep =' ', header=None)
file2 = pd.read_csv('file2.txt', sep=' ', header=None)
shortlist = file2.loc[file2[0].isin(file1.values.T[0])]
it will give you the following result:
0 1 2 3
1 john 20120212 1110 zjc9
3 peter 20140405 1 io90
5 tony 20150405 1001 ewdf
The above is a DataFrame to convert it back to an array just use shortlist.values
import pandas as pd
df1 = pd.read_csv('df1.txt', header=None)
df2 = pd.read_csv('df2.txt', header=None)
df1[0] = df1[0].str.strip() # remove the 2 whitespace followed by the feild
df2 = df2[0].str[0:-2].str.split(' ').apply(pd.Series) # split the word and remove whitespace
df = df1.merge(df2)
Out[26]:
0 1 2 3
0 tony 20150405 1001 ewdf
1 peter 20140405 0001 io90
2 john 20120212 1110 zjc9
I have a csv file with data like this:
Name Value Value2 Value3 Rating
ddf 34 45 46 ok
ddf 67 23 11 ok
ghd 23 11 78 bad
ghd 56 33 78 bad
.....
WHat I want to do is loop through my csv and add together the rows that have the same name, the string at the end of each row wil always remain the same for that name so there is no fear of it changing. How would I go about changing it to this in python?
Name Value Value2 Value3 Rating
ddf 101 68 57 ok
ghd 79 44 156 bad
EDIT:
In my code, the first thing I did was sort the list into order so the same names would be near each other, then I tried to use a for loop to add the numbered lines together by checking if the name value is the same on the first column. It's a very ugly way of doing it and I am at my wits end.
sortedList = csv.reader(open("keywordReport.csv"))
editedFile = open("output.csv",'w')
wr = csv.writer(editedFile, delimiter = ',')
name = ""
sortedList = sorted(sortedList, key=operator.itemgetter(0), reverse=True)
newKeyword = ["","","","","",""]
for row in sortedList:
if row[0] != name:
wr.writerow(newKeyword)
name = row[0]
else:
newKeyword[0] = row[0] #Name
newKeyword[1] = str(float(newKeyword[1]) + float(row[1]))
newKeyword[2] = str(float(newKeyword[2]) + float(row[2]))
newKeyword[3] = str(float(newKeyword[3]) + float(row[3]))
The pandas way is very simple:
import pandas as pd
aframe = pd.read_csv('thefile.csv')
Out[19]:
Name Value Value2 Value3 Rating
0 ddf 34 45 46 ok
1 ddf 67 23 11 ok
2 ghd 23 11 78 bad
3 ghd 56 33 78 bad
r = aframe.groupby(['Name','Rating'],as_index=False).sum()
Out[40]:
Name Rating Value Value2 Value3
0 ddf ok 101 68 57
1 ghd bad 79 44 156
If you need to do further analysis and statistics Pandas will take you a long way with little effort. For the use case here is like using a hammer to kill a fly, but I wanted to provide this alternative.
file.csv
Name,Value,Value2,Value3,Rating
ddf,34,45,46,ok
ddf,67,23,11,ok
ghd,23,11,78,bad
ghd,56,33,78,bad
code
import csv
def map_csv_rows(f):
c = [x for x in csv.reader(f)]
return [dict(zip(c[0], map(lambda p: int(p) if p.isdigit() else p, x))) for x in c[1:]]
my_csv = map_csv_rows(open('file.csv', 'rb'))
output = {}
for row in my_csv:
output.setdefault(row.get('Name'), {'Name': row.get('Name'), 'Value': 0,'Value2': 0, 'Value3': 0, 'Rating': row.get('Rating')})
for val in ['Value', 'Value2', 'Value3']:
output[row.get('Name')][val] = output[row.get('Name')][val] + row.get(val)
with open('out.csv', 'wb') as f:
fieldnames = ['Name', 'Value', 'Value2', 'Value3', 'Rating']
writer = csv.DictWriter(f, fieldnames = fieldnames)
writer.writeheader()
for out in output.values():
writer.writerow(out)
for comparison purposes, equivalent awk program
$ awk -v OFS="\t" '
NR==1{$1=$1;print;next}
{k=$1;a[k]+=$2;b[k]+=$3;c[k]+=$4;d[k]=$5}
END{for(i in a) print i,a[i],b[i],c[i],d[i]}' input
will print
Name Value Value2 Value3 Rating
ddf 101 68 57 ok
ghd 79 44 156 bad
if it's a csv input and you want csv output, need to add -F, argument and change to OFS=,
I am trying to get a simple python function which will read in a CSV file and find the average for come columns and rows.
The function will examine the first row and for each column whose header
starts with the letter 'Q' it will calculate the average of values in
that column and then print it to the screen. Then for each row of the
data it will calculate the students average for all items in columns
that start with 'Q'. It will calulate this average normally and also
with the lowest quiz dropped. It will print out two values per student.
the CSV file contains grades for students and looks like this:
hw1 hw2 Quiz3 hw4 Quiz2 Quiz1
john 87 98 76 67 90 56
marie 45 67 65 98 78 67
paul 54 64 93 28 83 98
fred 67 87 45 98 56 87
the code I have so far is this but I have no idea how to continue:
import csv
def practice():
newlist=[]
afile= input('enter file name')
a = open(afile, 'r')
reader = csv.reader(a, delimiter = ",")
for each in reader:
newlist.append(each)
y=sum(int(x[2] for x in reader))
print (y)
filtered = []
total = 0
for i in range (0,len(newlist)):
if 'Q' in [i][1]:
filtered.append(newlist[i])
return filtered
May I suggest the use of Pandas:
>>> import pandas as pd
>>> data = pd.read_csv('file.csv', sep=' *')
>>> q_columns = [name for name in data.columns if name.startswith('Q')]
>>> reduced_data = data[q_columns].copy()
>>> reduced_data.mean()
Quiz3 69.75
Quiz2 76.75
Quiz1 77.00
dtype: float64
>>> reduced_data.mean(axis=1)
john 74.000000
marie 70.000000
paul 91.333333
fred 62.666667
dtype: float64
>>> import numpy as np
>>> for index, column in reduced_data.idxmin(axis=1).iteritems():
... reduced_data.ix[index, column] = np.nan
>>> reduced_data.mean(axis=1)
john 83.0
marie 72.5
paul 95.5
fred 71.5
dtype: float64
You would have a nicer code if you change your .csv format. Then we can use DictReader easily.
grades.csv:
name,hw1,hw2,Quiz3,hw4,Quiz2,Quiz1
john,87,98,76,67,90,56
marie,45,67,65,98,78,67
paul,54,64,93,28,83,98
fred,67,87,45,98,56,87
Code:
import numpy as np
from collections import defaultdict
import csv
result = defaultdict( list )
with open('grades.csv', 'r') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
for k in row:
if k.startswith('Q'):
result[ row['name'] ].append( int(row[k]) )
for name, lst in result.items():
print name, np.mean( sorted(lst)[1:] )
Output:
paul 95.5
john 83.0
marie 72.5
fred 71.5