Reading a variable white space delimited table in python - python

Right now I am trying to read a table which has a variable whitespace delimiter and is also having missing/blank values. I would like to read the table in python and produce a CSV file. I have tried NumPy, Pandas and CSV libraries, but unfortunately both variable space and missing data together are making it near impossible for me to read the table. The file I am trying to read is attached here:
goo.gl/z7S2Mo
Would really appreciate if anyone can help me with a solution in python

You need your delimiter to be two spaces or more (instead of one space or more). Here's a solution:
import pandas as pd
df = pd.read_csv('infotable.txt',sep='\s{2,}',header=None,engine='python',thousands=',')
Result:
>>> print(df.head())
0 1 2 3 4 5 \
0 ISHARES MORNINGSTAR MID GROWTH ETP 464288307 3892 41700 SH
1 ISHARES S&P MIDCAP 400 GROWTH ETP 464287606 4700 47600 SH
2 BED BATH & BEYOND Common Stock 075896100 870 15000 SH
3 CARBO CERAMICS INC Common Stock 140781105 950 7700 SH
4 CATALYST HEALTH SOLUTIONS IN Common Stock 14888B103 1313 25250 SH
6 7 8 9
0 Sole 41700 0 0
1 Sole 47600 0 0
2 Sole 15000 0 0
3 Sole 7700 0 0
4 Sole 25250 0 0
>>> print(df.dtypes)
0 object
1 object
2 object
3 int64
4 int64
5 object
6 object
7 int64
8 int64
9 int64
dtype: object

The numpy module has a function to do just that (see last line):
import numpy as np
path = "<insert file path here>/infotable.txt"
# read off column locations from a text editor.
# I used Notepad++ to do that.
column_locations = np.array([1, 38, 52, 61, 70, 78, 98, 111, 120, 127, 132])
# My text editor starts counting at 1, while numpy starts at 0. Fixing that:
column_locations = column_locations - 1
# Get column widths
widths = column_locations[1:] - column_locations[:-1]
data = np.genfromtxt(path, dtype=None, delimiter=widths, autostrip=True)
Depending on your exact use case, you may use a different method to get the column widths but you get the idea. dtype=None ensures that numpy determines the data types for you; this is very different from leaving out the dtype argument. Finally, autostrip=True strips leading and trailing whitespace.
The output (data) is a structured array.

Related

How Solve a Data Science Question Using Python's Panda Data Structure Syntax

Good afternoon.
I have this question I am trying to solve using "panda" statistical data structures and related syntax from the Python scripting language. I am already graduated from a US university and employed while currently taking the Coursera.org course of "Python for Data Science" just for professional development, which is offered online at Coursera's platform by the University of Michigan. I'm not sharing answers to anyone either as I abide by Coursera's Honor Code.
First, I was given this panda dataframe chart concerning Olympic medals won by countries around the world:
# Summer Gold Silver Bronze Total # Winter Gold.1 Silver.1 Bronze.1 Total.1 # Games Gold.2 Silver.2 Bronze.2 Combined total ID
Afghanistan 13 0 0 2 2 0 0 0 0 0 13 0 0 2 2 AFG
Algeria 12 5 2 8 15 3 0 0 0 0 15 5 2 8 15 ALG
Argentina 23 18 24 28 70 18 0 0 0 0 41 18 24 28 70 ARG
Armenia 5 1 2 9 12 6 0 0 0 0 11 1 2 9 12 ARM
Australasia 2 3 4 5 12 0 0 0 0 0 2 3 4 5 12 ANZ
Second, the question asked is, "Which country has won the most gold medals in summer games?"
Third, a hint given me as to how to answer using Python's panda syntax is this:
"This function should return a single string value."
Fourth, I tried entering this as the answer in Python's panda syntax:
import pandas as pd
df = pd.read_csv('olympics.csv', index_col=0, skiprows=1)
def answer_one():
if df.columns[:2]=='00':
df.rename(columns={col:'Country'+col[4:]}, inplace=True)
df_max = df[df[max('Gold')]]
return df_max['Country']
answer_one()
Fifth, I have tried other various answers like this in Coursera's auto-grader, but
it keeps giving this error message:
There was a problem evaluating function answer_one, it threw an exception was thus counted as incorrect.
0.125 points were not awarded.
Could you please help me solve that question? Any hints/suggestions/comments are welcome for that.
Thanks, Kevin
You can use pandas' loc function to find the country name corresponding to the maximum of the "Gold" column:
data = [('Afghanistan', 13),
('Algeria', 12),
('Argentina', 23)]
df = pd.DataFrame(data, columns=['Country', 'Gold'])
df['Country'].loc[df['Gold'] == df['Gold'].max()]
The last line returns Argentina as answer.
Edit 1:
I just noticed you import the .csv file using pd.read_csv('olympics.csv', index_col=0, skiprows=1). If you leave out the skiprows argument you will get a dataframe where the first line in the .csv file correspond to column names in the dataframe. This makes handling of your dataframe much easier in pandas and is encouraged. Second, I see that using the index_col=0 argument you use the country names as indices in the dataframe. In this case you should choose to use index over the loc function as follows:
df.index[df['Gold'] == df['Gold'].max()][0]
import pandas as pd
def answer_one():
df1=pd.Series.max(df['Gold'])
df1=df[df['Gold']==df1]
return df1.index[0]
answer_one()
Function argmax() returns the index of the maximum element in the data frame.
return df['Gold'].argmax()

Can I forward recalculate a value in a pandas dataframe when a value has been reset, e.g. a water meter

I want to forward fill my water meter reading data when resets occur so that the data is clean for analysis. A reset is when the value in the next row is less than the previous one.
My python pandas dataframe looks like this:-
water
0 31031
1 31037
2 31038
3 31043
4 131 (system was reset)
5 223
6 331
7 412
...
It is possible that there are several resets of the water data in my pandas dataframe.
Research suggests that using for loops/iteration is not the best option with pandas dataframes so I am trying to avoid.
I would like to update the dataframe df so that the fact that the system was reset at index 4 is no longer visible and the water figures continue to cumulate.
e.g.
water
0 31031
1 31037
2 31038
3 31043
4 31174 # system reset to 0 so value should be 31043 + 131
5 31266 # continuing with the difference through to the end of df
6 31374
7 31445
...
import pandas as pd
df = pd.DataFrame({'water': [31031,31037,,31038,31043,131,223,331,412]})
df["waterreset"] = np.where(df["water"]-df["water"].shift(1)<0, df["water"] + df["water"].shift(1),df["water"])
print(df)
"waterreset" code line above only identifies the one line where the reset occurs and doesn't fill forward, plus I would rather use inplace=True to update the current dataframe.
You can find where the resets are, take the previous values, and add to the subsequence:
# resets
resets = df.water.diff().le(0)
# reading at resets
# cumsum is used to accumulate readings
readings = df.water.shift().where(resets).fillna(0).cumsum()
df.water += readings
Output:
water
0 31031.0
1 31037.0
2 31038.0
3 31043.0
4 31174.0
5 31266.0
6 31374.0
7 31455.0

how to deal with a copy-pasted table in pandas- reshaping a column vector

I have a table I copied from a webpage which when pasted into librecalc or excel occupies a single cell, and when pasted into notebook becomes a 3507x1 column. If I import this as a pandas dataframe using pd.read_csv I see the same 3507x1 column , and I'd now like to reshape it into the 501x7 array that it started as.
I thought I could recast as a numpy array, reshape as I am familiar with in numpy and then put back into a df, but the to_numpy methods of pandas seem to want to work with a Series object (not Dataframe) and attempts to read the file into a Series using eg
ser= pd.Series.from_csv('billionaires')
led to tokenizing errors. Is there some simple way to do this? Maybe I should throw in the towel on this direction and read from the html?
A simple copy paste does not give you any clear column separator, so it's impossible to do it easily.
You have only spaces, but spaces may or may not be inside the column values too (like in the name or country) so is impossible to give to DataFrame.read_csv a column separator.
However, if I copy paste the table in a file, I notice regularity.
If you know regex, you can try using pandas.Series.str.extract. This method extracts capture groups in a regex pattern as columns of a DataFrame. The regex is applied to each element / string of the series.
You can then try to find a regex pattern to capture the various elements of the row to split them into separate columns.
df = pd.read_csv('data.txt', names=["A"]) #no header in the file
ss = df['A']
rdf = ss.str.extract('(\d)\s+(.+)(\$[\d\.]+B)\s+([+-]\$[\d\.]+[BM])\s+([+-]\$[\d\.]+B)\s+([\w\s]+)\s+([\w\s]+)')
Here I tried to write a regex for the table in the link, the result on the first seems pretty good.
0 1 2 3 4 5 6
0 1 Jeff Bezos $121B +$231M -$3.94B United States Technology
1 3 Bernard Arnault $104B +$127M +$35.7B France Consumer
2 4 Warren Buffett $84.9B +$66.3M +$1.11B United States Diversified
3 5 Mark Zuckerberg $76.7B -$301M +$24.6B United States Technology
4 6 Amancio Ortega $66.5B +$303M +$7.85B Spain Retail
5 7 Larry Ellison $62.3B +$358M +$13.0B United States Technology
6 8 Carlos Slim $57.0B -$331M +$2.20B Mexico Diversified
7 9 Francoise Bettencourt Meyers $56.7B -$1.12B +$10.5B France Consumer
8 0 Larry Page $55.7B +$393M +$4.47B United States Technology
I used DataFrame.read_csv to read the file, since `Series.from_csv' is deprecated.
I found that converting to a numpy array was far easier than I had realized - the numpy asarray method can handle a df (and conveniently enough it works for general objects, not just numbers)
df = pd.read_csv('billionaires',sep='\n')
print(df.shape)
-> (3507, 1)
n = np.asarray(df)
m = np.reshape(n,[-1,7])
df2=pd.DataFrame(m)
df2.head()
0 1 2 3 4 \
0 0 Name Total net worth $ Last change $ YTD change
1 1 Jeff Bezos $121B +$231M -$3.94B
2 2 Bill Gates $107B -$421M +$16.7B
3 3 Bernard Arnault $104B +$127M +$35.7B
4 4 Warren Buffett $84.9B +$66.3M +$1.11B
5 6
0 Country Industry
1 United States Technology
2 United States Technology
3 France Consumer
4 United States Diversified

Remove comma from objects in a pandas dataframe column [duplicate]

This question already has answers here:
Convert number strings with commas in pandas DataFrame to float
(4 answers)
Closed 9 months ago.
I have imported a csv file using pandas.
My dataframe has multiple columns titled "Farm", "Total Apples" and "Good Apples".
The numerical data imported for "Total Apples" and "Good Apples" contains commas to indicate thousands e.g. 1,200 etc.
I want to remove the comma so the data looks like 1200 etc.
The variable type for the "Total Apples" and "Good Apples" columns comes up as object.
I tried using df.str.replace and df.strip but have not been successful.
Also tried to change the variable type from object to string and object to integer but couldn't make it work.
Any help would be greatly appreciated.
****EDIT****
Excerpt of data from csv file imported using pd.read_csv:
Farm_Name Total Apples Good Apples
EM 18,327 14,176
EE 18,785 14,146
IW 635 486
L 33,929 24,586
NE 12,497 9,609
NW 30,756 23,765
SC 8,515 6,438
SE 22,896 17,914
SW 11,972 9,114
WM 27,251 20,931
Y 21,495 16,662
I think you can add parameter thousands to read_csv, then values in columns Total Apples and Good Apples are converted to integers:
Maybe your separator is different, dont forget change it. If separator is whitespace, change it to sep='\s+'.
import pandas as pd
import io
temp=u"""Farm_Name;Total Apples;Good Apples
EM;18,327;14,176
EE;18,785;14,146
IW;635;486
L;33,929;24,586
NE;12,497;9,609
NW;30,756;23,765
SC;8,515;6,438
SE;22,896;17,914
SW;11,972;9,114
WM;27,251;20,931
Y;21,495;16,662"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), sep=";",thousands=',')
print df
Farm_Name Total Apples Good Apples
0 EM 18327 14176
1 EE 18785 14146
2 IW 635 486
3 L 33929 24586
4 NE 12497 9609
5 NW 30756 23765
6 SC 8515 6438
7 SE 22896 17914
8 SW 11972 9114
9 WM 27251 20931
10 Y 21495 16662
print df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11 entries, 0 to 10
Data columns (total 3 columns):
Farm_Name 11 non-null object
Total Apples 11 non-null int64
Good Apples 11 non-null int64
dtypes: int64(2), object(1)
memory usage: 336.0+ bytes
None
try this:
locale.setlocale(locale.LC_NUMERIC, '')
df = df[['Farm Name']].join(df[['Total Apples', 'Good Apples']].applymap(locale.atof))

Editing data in CSV files using Pandas

I have a CSV file with the following data:
Time Pressure
0 2.9852.988
10 2.9882.988
20 2.9902.990
30 2.9882.988
40 2.9852.985
50 2.9842.984
60 2.9852.985.....
for some reason the second column is separated by 2 decimal points. I'm trying to create a dataFrame with pandas but cannot proceed without removing the second decimal point. I cannot do this manually as there are thousands of data points in my file. any ideas?
You can call the vectorised str methods to split the string on decimal point, join the result of split but discard the last element, this produces for example a list [2,9852] which you then join with a decimal point:
In [28]:
df['Pressure'].str.split('.').str[:-1].str.join('.')
Out[28]:
0 2.9852
1 2.9882
2 2.9902
3 2.9882
4 2.9852
5 2.9842
6 2.9852
Name: Pressure, dtype: object
If you want to convert the string to a float then call astype:
In [29]:
df['Pressure'].str.split('.').str[:-1].str.join('.').astype(np.float64)
Out[29]:
0 2.9852
1 2.9882
2 2.9902
3 2.9882
4 2.9852
5 2.9842
6 2.9852
Name: Pressure, dtype: float64
Just remember to assign the conversion back to the original df:
df['Pressure'] = df['Pressure'].str.split('.').str[:-1].str.join('.').astype(np.float64)

Categories