Offset in reading columns in a textfile with matplotlib

Offset in reading columns in a textfile with matplotlib - python

I have a text file containing an array of numbers from which I want to plot certain columns vs other columns. I defined a column function so I can assign a name to each column and then plot them, as in this sample code:
def column(matrix,i):
return [float(row.split()[i]) for row in matrix]
Db = file('ResolutionEffects', 'r' )
HIcontour = column(Db,1)
Db.seek(1)
However when I display a column in my terminal to check that Python is indeed reading the right one, it appears that the first value of the column (as returned in my terminal) is actually the first value of the NEXT column in the text file. All the other numbers are from the correct column. There are no blank spaces or lines in the text file. As far as I can tell this offset happens to every column after the first one.
If anyone can tell why this is happening, or find a more robust way to read columns in text files I would greatly appreciate it.

Indeed I found loadtext to be a lot more robust. After converting my text file to a data file (.dat) I simply use this:
a=np.loadtxt('ResolutionEffects.dat', usecols=(0,1,11,12))
ax1.plot(a[:,0], a[:,1], 'dk', label='HI')
ax1.plot(a[:,2], a[:,3], 'dr', label='CO')
No weird offsets or bugs anymore :) Thanks Ajean and jedwards!

Related

Python) to_csv: my csv file data is separated and pushed back

I'm saving my pd.DataFrame with
"""df.to_csv('df.csv', encoding='utf-8-sig)"""
my csv file have a problem...
please see rows, where have content2-1, content2-2, and content2-3 in this pic.
Before saving(to_csv), there was no problem. All the data had right columns, 'content2' was not separated. but after df -> csv...
'content2' is all separated, and the others of 'id2' are allocated to the wrong columns.
"2018-04-21" have to be in column D, 0 with E,F,G, and url must be in column I.
why this happen? because of large csv file?(774,740KB), because of language?(Korean), or csv cannot recognize enter key?(All data with problems such as content2 were separated based on the enter key.)
how can I resolve this? I have no idea

Unfortunately I never figured out the reason for this.. I assumed it was something to do with the size of the data i was working with and excel not liking it.
What worked for me though was using .to_excel() instead of to_csv(). I know, far from a perfect answer, but thought id put it here incase it is enough for your case

Replacing empty cells in column with variable value

I'm trying to replace the emtpy cells of a column called 'City' with the most common value in the same column through the use of the python library (i think) called pandas.
(workin with a csv file here)
This is what i've tried, assume the file is read and ready to be edited:
location = df['City'].mode()
basicdf = "df['City'].replace('',"+location+", inplace=True)"
basicdf
so the logic here was to use .mode which gives the most frequent value in a row and make that value into the variable 'location'
and then add that variable into the second line of code.
(i dont know how to do all this in the correct way at all.)
the second line of code seemed to be the only way to add whatever variable i desire into this .replace command.
Edit: have tried this code instead, this ends up writing in other columns aswell, other than 'City' which is not great.
df['City'].replace('',np.nan,inplace=True)
df = df.fillna(df['City'].value_counts().index[0])
any tips would be appreciated, mainly how to achieve what im trying to do ( while not needing to restart from scratch, cause i have a lot of other code in the file using pandas library) and
how to insert variables into these pandas commands (if even possible).

Found the answer, thanks mainly to Pygirl,
df['City'].replace('',np.nan,inplace=True)
df['City'].fillna(df['City'].value_counts().index[0], inplace=True)
these will first replace the blanks or empty cells with NaNs and then 'fill' in the NaNs with the most common value in the column selected, in this case: 'City'.

Using Python & NLP, how can I extract certain text strings & corresponding numbers preceding the strings from Excel column having a lot of free text?

I am relatively new to Python and very new to NLP (and nltk) and I have searched the net for guidance but not finding a complete solution. Unfortunately the sparse code I have been playing with is on another network, but I am including an example spreadsheet. I would like to get suggested steps in plain English (more detailed than I have below) so I could first try to script it myself in Python 3. Unless it would simply be easier for you to just help with the scripting... in which case, thank you.
Problem: A few columns of an otherwise robust spreadsheet are very unstructured with anywhere from 500-5000 English characters that tell a story. I need to essentially make it a bit more structured by pulling out the quantifiable data. I need to:
1) Search for a string in the user supplied unstructured free text column (The user inputs the column header) (I think I am doing this right)
2) Make that string a NEW column header in Excel (I think I am doing this right)
3) Grab the number before the string (This is where I am getting stuck. And as you will see in the sheet, sometimes there is no space between the number and text and of course, sometimes there are misspellings)
4) Put that number in the NEW column on the same row (Have not gotten to this step yet)
I will have to do this repeatedly for multiple keywords but I can figure that part out, I believe, with a loop or something. Thank you very much for your time and expertise...

If I'm understanding this correctly, first we need to obtain the numbers from the string of text.
cell_val = sheet1wb1.cell(row=rowNum,column=4).value
This will create a list containing every number in the string
new_ = [int(s) for s in cell_val.split() if s.isdigit()]
print(new_)
You can use the list to assign the values to the column.
Then define the value of the 1st number in the list to the 5th column
sheet1wb1.cell(row=rowNum, column=5).value = str(new_[1])

I think I have found what I am looking for. https://community.esri.com/thread/86096 has 3 or 4 scripts that seem to do the trick. Thank you..!

Is it possible to Skip Blank Lines in a Dataframe? If Yes then how I can do this

I am trying to run this code
num = df_out.drop_duplicates(subset=['Name', 'No.']).groupby.(['Name']).size()
But when I do I get this error:
ValueError: not enough values to unpack (expected 2, got 0)
If we think about my dataframe(df_out) as an excel file I do have blank cells but no full column or full row is blank. I needed to skip the blank lines to run the code without changing the dataframe's structure.
Is this possible?
Thank you

Consider using df.dropna(). It is uses to remove rows that contains NA. See https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html for more information.
At first, you probably want your "blank cells" to be converted to NA value, so they can be dropped by dropna(). This can be done using various methods, notably df.replace(r'\s+', pandas.np.nan, regex=True). If your "blank cells" are all empty strings, or fixed strings equal to some value s, you can directly use (first case) df.replace('', pandas.np.nan) or (second case) df.replace(s, pandas.np.nan).

How to merge columns with no header name in a python script?

My Python script parsed some text of a Excel file. It strips white-space from an Excel file and changes the delimiters
(from " : "--> " , ")
and my script outputs to a CSV file. Much of the data looks like this
(what data looks like in Excel)
Separated by a single column due to there being a extra comma or two.
CSV == Comma separated values.
I have tried using if statements to add or subtract commas to try shore it up but it ends up completely messing up the relative order it was first in. Driving me nuts!
To try do it another way installed the pandas library (a data manipulating library) using pip.
Is it possible to merge columns that have no column headers inside a single Data Frame? There's plenty of advice regarding separate DataFrames but much for one single one.
Furthermore how can I merge the columns while retaining the row position. The emails are in the correct row position but not the column position.
Or am I on the wrong track completely, is pandas overkill for a simple parsing script? I've been learning python as I go along to try complete the script so I might have missed a simple way of doing it.
Some sample data:
C5XXEmployeeNumXX,C5XXEmployeeNumXX,JohnSmith,1,,John,,Smith,,IT Supp.Centre,EU,,London1,,,59XXXX,ITServiceDesk,LOND01,,,,Notmaintained,,,,,,,,john.smith#company.com,
Snippet of parsing logic
for line in f:
#finds the identifier for users
if ':LON ' in line:
#parsing logic.
#Delimiters are swapped. Whitespace is scrubbed
line = line.replace(':', ',')
line = line.replace(' ', '')

You can user a separator/delimiter of your choice. Check out: https://docs.python.org/2/library/csv.html#csv.Dialect.delimiter.
Also, regarding the order, if you are reading in a list it should be fine but if you are reading the contents of a row in a dict then it is normal that the order is not preserved.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.