Here is the code and the output. I assume is it about the score not being int, but not sure how to convert in this case
df.index = df.columns
rows = []
for i in df.index:
for c in df.columns:
if i == c: continue
score = df.ix[i, c]
score = [int(row) for row in score.split('-')]
rows.append([i, c, score[0], score[1]])
df = pd.DataFrame(rows, columns = ['home', 'away', 'home_score', 'away_score'])
df.head()
You're splitting on "-" (U+0020 HYPHEN-MINUS), but your data is using some other character... it's hard to say since you provided a picture of the error instead of the actual error, but it's probably "–" (U+2013 EN DASH). Fix your split to use the character that actually occurs in the input.
I think you should just do
score = [int(row) for row in score if row.isnumeric()]
Take advantage of the .isnumeric() method of strings.
P.S. You should not be using .ix method with Pandas. I am pretty sure it is deprecated.
The screenshot you posted is identifying the issue. You are trying to call int() on the str 0-3. This can't be done. From my terminal
In [1]: int('0-3')
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-1-c6bc87cd2bc7> in <module>
----> 1 int('0-3')
ValueError: invalid literal for int() with base 10: '0-3'
Looking at your dataframe it looks like you have a lot of data that can't be turned into an int as is
Related
so I am trying to cut off (convert from object to int dtype) the m^2 from the Erf Size column in the dataset using the following code: train['Erf Size'] = train['Erf Size'].str[:-2].astype(int)
However I am getting this error statement instead ValueError: invalid literal for int () with base 10: '1 733'
Please kindly help.
It may not be the most efficient way to get rid of a string in a df/col but the code below is working.
The method used is pandas.Series.str.split :
import pandas as pd
train = pd.read_csv('bernard.csv', encoding='ansi')
train[['Size', 'm²']] = train['Erf Size'].str.split(' ', expand=True)
train['Erf Size'] = train['Size']
train.drop(['Size', 'm²'], axis=1, inplace=True)
>>> train
I,m trying to add empty column in my dataset on colab but it give me this error. and when I,m trying to run it on my local machine it works perfectly fine. does anybody know possible solution for this?
My code.
dataframe["Comp"] = ''
dataframe["Negative"] = ''
dataframe["Neutral"] = ''
dataframe["Positive"] = ''
dataframe
Error message
TypeError: Expected unicode, got pandas._libs.properties.CachedProperty
I run into similar issue today.
"Expected unicode, got pandas._libs.properties.CachedProperty"
my dataframe(called df) has timeindex. When add a new column to it, and fill with numpy.array data, it raise this error. I tried set it with df.index or df.index.value. It always raise this error.
Finally, I solved by 3 stesp:
df = df.reset_index()
df['new_column'] = new_column_data # it is np.array format
df = df.set_index('original_index_name')
WY
this Quetion is the same as https://stackoverflow.com/a/67997139/16240186, and there's a simple way to solve it: df = df.asfreq('H') # freq can be min\D\M\S\5min etc.
I've written the function (tested and working) below:
import pandas as pd
def ConvertStrDateToWeekId(strDate):
dateformat = '2016-7-15 22:44:09'
aDate = pd.to_datetime(strDate)
wk = aDate.isocalendar()[1]
yr = aDate.isocalendar()[0]
Format_4_5_4_date = str(yr) + str(wk)
return Format_4_5_4_date'
and from what I have seen on line I should be able to use it this way:
ml_poLines = result.value.select('PURCHASEORDERNUMBER', 'ITEMNUMBER', PRODUCTCOLORID', 'RECEIVINGWAREHOUSEID', ConvertStrDateToWeekId('CONFIRMEDDELIVERYDATE'))
However when I "show" my dataframe the "CONFIRMEDDELIVERYDATE" column is the original datetime string! NO errors are given.
I've also tried this:
ml_poLines['WeekId'] = (ConvertStrDateToWeekId(ml_poLines['CONFIRMEDDELIVERYDATE']))
and get the following error:
"ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions." which makes no sense to me.
I've also tried this with no success.
x = ml_poLines.toPandas();
x['testDates'] = ConvertStrDateToWeekId(x['CONFIRMEDDELIVERYDATE'])
ml_poLines2 = spark.createDataFrame(x)
ml_poLines2.show()
The above generates the following error:
AttributeError: 'Series' object has no attribute 'isocalendar'
What have I done wrong?
Your function ConvertStrDateToWeekId takes a string. But in the following line the argument of the function call is a series of strings:
x['testDates'] = ConvertStrDateToWeekId(x['CONFIRMEDDELIVERYDATE'])
A possible workaround for this error is to use the apply-function of pandas:
x['testDates'] = x['CONFIRMEDDELIVERYDATE'].apply(ConvertStrDateToWeekId)
But without more information about the kind of data you are processing it is hard to provide further help.
This was the work-around that I got to work:
`# convert the confirimedDeliveryDate to a WeekId
x= ml_poLines.toPandas();
x['WeekId'] = x[['ITEMNUMBER', 'CONFIRMEDDELIVERYDATE']].apply(lambda y:ConvertStrDateToWeekId(y[1]), axis=1)
ml_poLines = spark.createDataFrame(x)
ml_poLines.show()`
Not quite as clean as I would like.
Maybe someone else cam propose a cleaner solution.
The HTTP log files I'm trying to analyze with pandas have sometimes unexpected lines. Here's how I load my data :
df = pd.read_csv('mylog.log',
sep=r'\s(?=(?:[^"]*"[^"]*")*[^"]*$)(?![^\[]*\])',
engine='python', na_values=['-'], header=None,
usecols=[0, 3, 4, 5, 6, 7, 8,10],
names=['ip', 'time', 'request', 'status', 'size',
'referer','user_agent','req_time'],
converters={'status': int, 'size': int, 'req_time': int})
It works fine for most of the logs I have (which come from the same server). However, upon loading some logs, an exception is raised :
either
TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'
or
ValueError: invalid literal for int() with base 10: '"GET /agent/10577/bdl HTTP/1.1"'
For the sake of the example, here's the line that triggers the second exception:
22.111.117.229, 22.111.117.229 - - [19/Sep/2018:22:17:40 +0200] "GET /agent/10577/bdl HTTP/1.1" 204 - "-" "okhttp/3.8.0" apibackend.site.fr 429282
To find the number of the incriminated line, I used the following (terribly slow) function :
def search_error_dichotomy(path):
borne_inf = 0
log = open(path)
borne_sup = len(log.readlines())
log.close()
while borne_sup - borne_inf>1:
exceded = False
search_index = (borne_inf + borne_sup) // 2
try:
pd.read_csv(path,...,...,nrows=search_index)
except:
exceded = True
if exceded:
borne_sup = search_index
else:
borne_inf = search_index
return search_index
What I'd like to have is something like this :
try:
pd.read_csv(..........................)
except MyError as e:
print(e.row_number)
where e.row_number is the number of the messy line.
Thank you in advance.
SOLUTION
All credits to devssh, whose suggestion not only makes the process quicker, but allows me to get all unexpected line at once. Here's what I did out of it :
Load the dataframe without converters.
df = pd.read_csv(path,
sep=r'\s(?=(?:[^"]*"[^"]*")*[^"]*$)(?![^\[]*\])',
engine='python', na_values=['-'], header=None,
usecols=[0, 3, 4, 5, 6, 7, 8,10],
names=['ip', 'time', 'request', 'status', 'size',
'referer', 'user_agent', 'req_time'])
Add an 'index' column using .reset_index() .
df = df.reset_index()
Write custom function (to be used with apply), that converts to int if possible, otherwise saves
the entry and the 'index' in a dictionary wrong_lines
wrong_lines = {}
def convert_int_feedback_index(row,col):
try:
ans = int(row[col])
except:
wrong_lines[row['index']] = row[col]
ans = pd.np.nan
return ans
Use apply on the columns I want to convert (eg col = 'status', 'size', or 'req_time')
df[col] = df.apply(convert_int_feedback_index, axis=1, col=col)
Did you try pd.read_csv(..., nrows=10) to see if it works on even 10 lines?
Perhaps you should not use converters to specify the dtypes.
Load the DataFrame then apply the dtype to columns like df["column"] = df["column"].astype(np.int64) or a custom function like df["column"]=df["column"].apply(lambda x: convert_type(x)) and handle the errors yourself in the function convert_type.
Finally, update the csv by calling df.to_csv("preprocessed.csv", headers=True, index=False).
I don't think you can get the line number from the pd.read_csv itself. That separator itself looks too complex.
Or you can try just reading the csv as a single column DataFrame and use df["column"].str.extract to use regex to extract the columns. That way you control how the exception is to be raised or the default value to handle the error.
df.reset_index() will give you the row numbers as a column. That way if you apply to two columns, you will get the row number as well. It will give you index column with row numbers. Combine that with apply over multiple columns and you can customize everything.
I'm trying to count the individual words in a column of my data frame. It looks like this. In reality the texts are Tweets.
text
this is some text that I want to count
That's all I wan't
It is unicode text
So what I found from other stackoverflow questions is that I could use the following:
Count most frequent 100 words from sentences in Dataframe Pandas
Count distinct words from a Pandas Data Frame
My df is called result and this is my code:
from collections import Counter
result2 = Counter(" ".join(result['text'].values.tolist()).split(" ")).items()
result2
I get the following error:
TypeError Traceback (most recent call last)
<ipython-input-6-2f018a9f912d> in <module>()
1 from collections import Counter
----> 2 result2 = Counter(" ".join(result['text'].values.tolist()).split(" ")).items()
3 result2
TypeError: sequence item 25831: expected str instance, float found
The dtype of text is object, which from what I understand is correct for unicode text data.
The issue is occurring because some of the values in your series (result['text']) is of type float. If you want to consider them during ' '.join() as well, then you would need to convert the floats to string before passing them onto str.join().
You can use Series.astype() to convert all the values to string. Also, you really do not need to use .tolist() , you can simply give the series to str.join() as well. Example -
result2 = Counter(" ".join(result['text'].astype(str)).split(" ")).items()
Demo -
In [60]: df = pd.DataFrame([['blah'],['asd'],[10.1]],columns=['A'])
In [61]: df
Out[61]:
A
0 blah
1 asd
2 10.1
In [62]: ' '.join(df['A'])
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-62-77e78c2ee142> in <module>()
----> 1 ' '.join(df['A'])
TypeError: sequence item 2: expected str instance, float found
In [63]: ' '.join(df['A'].astype(str))
Out[63]: 'blah asd 10.1'
In the end I went with the following code:
pd.set_option('display.max_rows', 100)
words = pd.Series(' '.join(result['text'].astype(str)).lower().split(" ")).value_counts()[:100]
words
The problem was however solved by Anand S Kumar.