I have a dataframe read from a CSV file. I need to generate new data and add them to the end of old ones.
But it's strange that it shows a totally different result when compare small scale and large scale. I guess it may relate to view, copy() & Chained assignment.
I tried 2 options to use pd.copy() to avoid potential problems.
First option:
d_jlist = pd.read_csv('127case.csv', sep=',') #got the data shape: (46355,48) from CSV file
d_jlist2 = d_jlist.copy() #Use deep copy, in case of change the raw data
d_jlist3 = pd.DataFrame()
a = np.random.choice(range(5,46350),size = 1000*365) #Select from row 5 to row 46350
for i in a:
d_jlist3 = d_jlist3.append(d_jlist.iloc[i].copy() +np.random.uniform(-1,1) )
d_jlist3 = d_jlist3.replace(0,0.001,regex=True)
d_jlist3 = d_jlist3.round(3)
d_jlist = d_jlist.append(d_jlist3)
a = consumption.columns.values #Something to do with header
a = a[5:53]
d_jlist.to_csv('1127case_1.csv',header = a,index=False)
Second option:
d_jlist = pd.read_csv('127case.csv', sep=',')
d_jlist2 = d_jlist.copy()
d_jlist3 = pd.DataFrame()
a = np.random.choice(range(5,46350),size = 1000*365)
for i in a:
d_jlist3 = d_jlist3.append(d_jlist2.iloc[i] +np.random.uniform(-1,1) )
d_jlist3 = d_jlist3.replace(0,0.001,regex=True)
d_jlist3 = d_jlist3.round(3)
d_jlist = d_jlist.append(d_jlist3)
a = consumption.columns.values #Something to do with header
a = a[5:53]
d_jlist.to_csv('1117case_2.csv',header = a,index=False)
The problem is, if I use these code on a small scale, it works as expected. New rows add to the old ones, and nothing in old data changed.
However, if I come to the scale above (1000*365), the old rows will get changed.
And the strange thing is: only the first two columns of each row will stay unchanged. The rest of the columns of each row will all get changed.
The results:
The left one is old dataframe, it has (46356,48) shape. Below are the new data generated.
The right one is result got from option 1 (both options got same result). From the third columns, the old data got changed.
If I try either of the options in smaller scale (3 rows), it will be fine. All the old data will be kept.
d_jlist = pd.read_csv('127case.csv', sep=',')
d_jlist = d_jlist.iloc[:10] #Only select 10 rows from old ones
d_jlist2 = d_jlist.copy()
d_jlist3 = pd.DataFrame()
a = np.random.choice(range(5,6),size = 3) #Only select 3 rows randomly from old data
for i in a:
d_jlist3 = d_jlist3.append(d_jlist2.iloc[i] +np.random.uniform(-1,1) )
d_jlist3 = d_jlist3.replace(0,0.001,regex=True)
d_jlist3 = d_jlist3.round(3)
d_jlist = d_jlist.append(d_jlist3)
a = consumption.columns.values #Something to do with header
a = a[5:53]
d_jlist.to_csv('1117case_2.csv',header = a,index=False)
How can I understand this? I spent lots of time try to find explanation for this but failed.
Are some rules changed in Pandas when the scale is larger (To 365K level)?
I'm not sure what happened, but my code has worked today, however not it won't. I have an Excel spreadsheet of projects I want to individually import and put into lists. However, I'm getting a "IndexError: index 8 is out of bounds for axis 0 with size 8" error and Google searches have not resolved this for me. Any help is appreciated. I have the following fields in my Excel sheet: id, funding_end, keywords, pi, summaryurl, htmlabstract, abstract, project_num, title. Not sure what I'm missing...
import pandas as pd
dataset = pd.read_excel('new_ahrq_projects_current.xlsx',encoding="ISO-8859-1")
df = pd.DataFrame(dataset)
cols = [0,1,2,3,4,5,6,7,8]
df = df[df.columns[cols]]
tt = df['funding_end'] = df['funding_end'].astype(str)
tt = df.funding_end.tolist()
for t in tt:
allenddates.append(t)
bb = df['keywords'] = df['keywords'].astype(str)
bb = df.keywords.tolist()
for b in bb:
allkeywords.append(b)
uu = df['pi'] = df['pi'].astype(str)
uu = df.pi.tolist()
for u in uu:
allpis.append(u)
vv = df['summaryurl'] = df['summaryurl'].astype(str)
vv = df.summaryurl.tolist()
for v in vv:
allsummaryurls.append(v)
ww = df['htmlabstract'] = df['htmlabstract'].astype(str)
ww = df.htmlabstract.tolist()
for w in ww:
allhtmlabstracts.append(w)
xx = df['abstract'] = df['abstract'].astype(str)
xx = df.abstract.tolist()
for x in xx:
allabstracts.append(x)
yy = df['project_num'] = df['project_num'].astype(str)
yy = df.project_num.tolist()
for y in yy:
allprojectnums.append(y)
zz = df['title'] = df['title'].astype(str)
zz = df.title.tolist()
for z in zz:
alltitles.append(z)
"IndexError: index 8 is out of bounds for axis 0 with size 8"
cols = [0,1,2,3,4,5,6,7,8]
should be cols = [0,1,2,3,4,5,6,7].
I think you have 8 columns but your col has 9 col index.
IndexError: index out of bounds means you're trying to insert or access something which is beyond its limit or range.
Every time, when you load either of these files such as an test.xlx, test.csv or test.xlsx file using Pandas such as:
data_set = pd.read_excel('file_example_XLS_10.xls', encoding="ISO-8859-1")
It would be better for everyone to find the length of columns of a DataFrame that will help you move forward when working with large Data_Sets. e.g.
import pandas as pd
data_set = pd.read_excel('file_example_XLS_10.xls', encoding="ISO-8859-1")
data_frames = pd.DataFrame(data_set)
print("Length of Columns:", len(data_frames.columns))
This will give you the exact number of columns of an Excel Spread-Sheet. Then you can specify the Data Frames Accordingly:
Length of Columns: 8
cols = [0, 1, 2, 3, 4, 5, 6, 7]
I agree with #Bill CX that it sounds like you're trying to access a column that doesn't exist. Although I cannot reproduce your error, I have some ideas that may help you move forward.
First, double check the shape of your data frame:
import pandas as pd
dataset = pd.read_excel('new_ahrq_projects_current.xlsx',encoding="ISO-8859-1")
df = pd.DataFrame(dataset)
print(df.shape) # print shape of data read in to python
The output should be
(X, 9) # "X" is the number of rows
If the data frame has 8 columns, then the df.shape will be (X, 8). This could be why your are getting the error.
Another check for you is to print out the first few rows of your data frame.
print(df.head)
This will let you double-check to see if you have read in the data in the correct form. I'm not sure, but it might be possible that your .xlsx file has 9 columns, but pandas is reading in only 8 of them.
I need to find a duplicates in my txt file. The file looks like this:
3,3090,21,f,2,3
4,231,22,m,2,3
5,9427,13,f,2,2
6,9942,7,m,2,3
7,6802,33,f,3,2
8,8579,11,f,2,4
9,8598,11,f,2,4
10,16729,23,m,1,1
11,8472,11,f,3,4
12,10976,21,f,3,3
13,2870,21,f,2,3
14,12032,10,f,3,4
15,16999,13,m,2,2
16,570,7,f,2,3
17,8485,11,f,2,4
18,8728,11,f,3,4
19,20861,9,f,2,2
20,19771,34,f,2,2
21,17964,10,f,2,2
There are ~30000 lines of this. And now, I need to find duplicates in the second column and save to the the new files without any duplicates. My code is:
def dedupe(data):
d = []
for l in lines:
if l[0] in d:
d[l[0]] += l[:1]
else:
d[l[0]] = l[1]
return d
#m - male
#f - female
data = open('plec.txt', 'r')
save_m = open('plec_m.txt', 'w')
save_f = open('plec_f.txt', 'w')
lines = data.readlines()[1:]
for line in lines:
gender = line.strip().split(',')[3]
if gender is 'f':
dedupe(line)
save_f.write(line)
elif gender is 'm':
dedupe(line)
save_m.write(line)
But I'm getting this error:
Traceback (most recent call last):
File "plec.py", line 88, in <module>
dedupe(line)
File "plec.py", line 75, in dedupe
d[l[0]] = l[1]
TypeError: list indices must be integers, not str'
EDIT 2018-10-28:
I don't remember what I had to sort in this file, I think that 2nd and 4th column must be unique but I'm not sure now. But I found wrong part in my code and because of it, I rebuilt all of code which is also working.
def dedup(my_list, new_file):
d = list()
for single_line in my_list:
if single_line.split(',')[1] not in [i.split(',')[1] for i in d]:
d.append(single_line)
print(len(my_list), len(d))
new_file.writelines(d)
data = open('plec.txt', 'r').readlines()[:1]
males = open('m.txt', 'w')
females = open('f.txt', 'w')
males_list = list()
females_list = list()
for line in data:
gender = line.split(',')[3]
if gender == 'm':
males_list.append(line)
if gender == 'f':
females_list.append(line)
dedup(males_list, males)
dedup(females_list, females)
You can use Pandas to read your input file and remove the duplicates based on any column you want.
from StringIO import StringIO
from pandas import DataFrame
data =StringIO("""col1,col2,col3,col4,col5,col6
3,3090,21,f,2,3
4,231,22,m,2,3
5,9427,13,f,2,2
6,9942,7,m,2,3
7,6802,33,f,3,2
8,8579,11,f,2,4
9,8598,11,f,2,4
10,16729,23,m,1,1
11,8472,11,f,3,4
12,10976,21,f,3,3
13,2870,21,f,2,3
14,12032,10,f,3,4
15,16999,13,m,2,2
16,570,7,f,2,3
17,8485,11,f,2,4
18,8728,11,f,3,4
19,20861,9,f,2,2
20,19771,34,f,2,2
21,17964,10,f,2,2""")
df = DataFrame.from_csv(data, sep=",", index_col=False)
df.drop_duplicates(subset='col2')
df.to_csv("no_dups.txt", index = false)
seen = set()
for row in my_filehandle:
my_2nd_col = row.split(",")[1]
if my_2nd_col in seen:
continue
output_filehandle.write(row)
seen.add(my_2nd_column)
is one very verbose way of doing this
OP, I don't know what's wrong with your code but this solution should fit your requirements, assuming your requirements are:
Filter the file based on second column
Store male and female entries in seperate files
Here's the code:
with open('plec.txt') as file:
lines = map(lambda line: line.split(','), file.read().split('\n')) # split the file into lines and the lines by comma
filtered_lines_male = []
filtered_lines_female = []
second_column_set = set()
for line in lines:
if(line[1] not in second_column_set):
second_column_set.add(line[1]) # add to index set
if(line[3] == 'm'):
filtered_lines_male.append(line) # add to male list
else:
filtered_lines_female.append(line) # add to female list
filtered_lines_male = '\n'.join([','.join(line) for line in filtered_lines_male]) # apply source formatting
filtered_lines_female = '\n'.join([','.join(line) for line in filtered_lines_female]) # apply source formatting
with open('plec_m.txt', 'w') as male_write_file:
male_write_file.write(filtered_lines_male) # write male entries
with open('plec_f.txt', 'w') as female_write_file:
female_write_file.write(filtered_lines_female) # write female entries
Please use better variable naming the next time you write code and please make sure your questions are more specific.