I am using pandas to read in a csv column where every row has the following format:
IP: XXX:XX:XX:XXX
To get rid of the IP: prefix, I am editing the column after the fact:
logs['ip'] = logs['ip'].str[4:]
Is there a way to perform this operation within read_csv, maybe with regex, to avoid the post-computation?
Update |
Consider this scenario where there are multiple columns that have these prefixes – is there a better way?
logs['mac'] = logs['mac'].str[5:]
logs['id'] = logs['id'].str[4:]
logs['lan'] = logs['lan'].str[5:]
logs['ip'] = logs['ip'].str[4:]
The converters option for read_csv might provide a useful way. Let's say the file looks like this:
id address
1 IP:123.1.1.1
2 IP:456.1.1.1
3 IP:789.1.1.1
Then you could specify that 'IP:' should be converted to '' (blank) like this:
dct = { 'address': lambda x: x.replace('IP:','') }
df = pd.read_csv( 'foo.txt', delimiter=' *', converters=dct )
id address
0 1 123.1.1.1
1 2 456.1.1.1
2 3 789.1.1.1
I'm ignoring the slight complication that if there is a space after IP: then you might be reading IP: in as it's own column, but you ought to be able to adapt this fairly easily to handle that.
you could just convert the csv column to a string the use .split("IP: ")[1] on the string which will contain everything except for "IP: ". I'm not sure if this is the best approach but it's what came to mind.
str.split("IP":\s")
Related
i have following CSV with following entries
"column1"| "column2"| "column3"| "column4"| "column5"
"123" | "sometext", "this somedata", "8 inches"", "hello"
The issue comes when i try to read 8 inches", i am unable to read the csv using read_csv().
Pandas.read_csv(io.BytesIO(obj['Body'].read()), sep="|",
quoting=1,
engine='c', error_bad_lines=False, warn_bad_lines=True,
encoding="utf-8", converters=pandas_config['converters'],skipinitialspace=True,escapechar='\"')
Is there a way to handle the quote within the cell.
Start from passing appropriate parameters for this case:
sep='[|,]' - there are two separators: a pipe char and a comma,
so define them as a regex.
skipinitialspace=True - your source text contains extra spaces (after
separators), so you should drop them.
engine='python' - to suppress a warning concerning Falling back to the
'python' engine.
The above options alone allow to call read_csv with no error, but the downside
(for now) is that double quotes remain.
To eliminate them, at least from the data rows, another trick is needed:
Define a converter (lambda) function:
cnv = lambda txt: txt.replace('"', '')
and apply it to all source columns.
In your case you have 5 columns, so to keep the code concise,
you can use a dictionary comprehension:
{ i: cnv for i in range(5) }
So the whole code can be:
df = pd.read_csv(io.StringIO(txt), sep='[|,]', skipinitialspace=True,
engine='python', converters={ i: cnv for i in range(5) })
and the result is:
"column1" "column2" "column3" "column4" "column5"
0 123 sometext this somedata 8 inches hello
But remember that now all columns are of string type, so you should
convert required columns to numbers.
An alternative is to pass second converter for numeric columns,
returning a number instead of a string.
To have proper column names (without double quotes), you can pass additional parameters:
skiprows=1 - to omit the initial line,
names=["column1", "column2", "column3", "column4", "column5"] - to
define the column list on your own.
We can specify a somewhat complicated separator, read the datas and strip the extra quote chars:
# Test data:
text='''"column1"| "column2"| "column3"| "column4"| "column5"
"123" | "sometext", "this somedata", "8 inches"", "hello"'''
ff=io.StringIO(text)
df= pd.read_csv(ff,sep=r'"\s*[|,]\s*"',engine="python")
# Make it tidy:
df= df.transform(lambda s: s.str.strip('"'))
df.columns= ["column1"]+list(df.columns[1:-1])+["column5"]
The result:
column1 column2 column3 column4 column5
0 123 sometext this somedata 8 inches hello
I open raw data using pandas
df=pd.read_cvs(file)
Here's part of my dataframe look like:
37280 7092|156|Laboratory Data|A648C751-A4DD-4CZ2-85
47981 7092|156|Laboratory Data|Z22CD01C-8Z4B-4ZCB-8B
57982 7092|156|Laboratory Data|C12CE01C-8F4B-4CZB-8B
I'd like to replace all pipe('|') into tab ('\t')
So I tried :
df.replace('|','\t')
But it never works. How could I do this?
Many thanks!
The replace method on data frame by default is meant to replace values exactly match the string provided; You need to specify regex=True to replace patterns, and since | is a special character in regex, an escape is needed here:
df1 = df.replace("\|", "\t", regex=True)
df1
# 0 1
#0 37280 7092\t156\tLaboratory Data\tA648C751-A4DD-4CZ2-85
#1 47981 7092\t156\tLaboratory Data\tZ22CD01C-8Z4B-4ZCB-8B
#2 57982 7092\t156\tLaboratory Data\tC12CE01C-8F4B-4CZB-8B
If we print the cell, the tab are printed as expected:
print(df1[1].iat[0])
# 7092 156 Laboratory Data A648C751-A4DD-4CZ2-85
Just need to set the variable to itself:
df = df.replace('|', '\t')
I am using Python to analyze a large set of CSV data. This data contains 4 different types of metrics for a given timestamp and host pair, with the metric type indicated in the first field of each row. Here's a simplified example:
metric,timestamp,hostname,value
metric1,1488063747,example01.net,12
metric2,1488063747,example01.net,23
metric3,1488063747,example01.net,34
metric4,1488063747,example01.net,45
metric1,1488063788,example02.net,56
metric2,1488063788,example02.net,67
metric3,1488063788,example02.net,78
metric4,1488063788,example02.net,89
So, for every row (actually, a list within a list of lists) I make an index composed of the timestamp and hostname:
idx = row[1] + ',' + row[2]
Now, based on the contents of the first field (list element), I do something like:
if row[0] == 'metric1': metric_dict[idx] = row[3]
I do that for each of the 4 metrics. It works, but it seems like there should be a better way. It seems like I need to somehow implicitly or indirectly choose the dictionary to be used based on the contents of row[0], but my searches have not yielded a result. In this case, 4 if lines are not tough, but it wouldn't be unusual for more metric types to be contained in a file. Is it possible to do this and be left with however many dictionaries are needed after the list of lists is read? Thank you.
Problem: not enough dicts.
Solution:
conversion_dict = {'metric1': metric1_dict, 'metric2': metric2_dict}
for row:
conversion_dict[row[0]][idx] = row[3]
Why not something like
output = {}
for row in rows:
# assuming this data is already split
if not row[0] in output:
output[row[0]] = {}
idx = row[1] + ',' + row[2]
output[row[0]][idx] = row[3]
If you're doing a lot of table manipulation, you may find the pandas library helpful. If I understand correctly what you're trying to do:
import pandas as pd
from StringIO import StringIO
s = StringIO("""metric,timestamp,hostname,value
metric1,1488063747,example01.net,12
metric2,1488063747,example01.net,23
metric3,1488063747,example01.net,34
metric4,1488063747,example01.net,45
metric1,1488063788,example02.net,56
metric2,1488063788,example02.net,67
metric3,1488063788,example02.net,78
metric4,1488063788,example02.net,89
""")
df = pd.read_csv(s)
df.pivot(index="timestamp", columns='metric',values='value')
This yields:
metric metric1 metric2 metric3 metric4
timestamp
1488063747 12 23 34 45
1488063788 56 67 78 89
I am receiving an object array after applying re.findall for link and hashtags on Tweets data. My data looks like
b=['https://t.co/1u0dkzq2dV', 'https://t.co/3XIZ0SN05Q']
['https://t.co/CJZWjaBfJU']
['https://t.co/4GMhoXhBQO', 'https://t.co/0V']
['https://t.co/Erutsftlnq']
['https://t.co/86VvLJEzvG', 'https://t.co/zCYv5WcFDS']
Now I want to split it in columns, I am using following
df = pd.DataFrame(b.str.split(',',1).tolist(),columns = ['flips','row'])
But it is not working because of weird datatype I guess, I tried few other solutions as well. Nothing worked.And this is what I am expecting, two separate columns
https://t.co/1u0dkzq2dV https://t.co/3XIZ0SN05Q
https://t.co/CJZWjaBfJU
https://t.co/4GMhoXhBQO https://t.co/0V
https://t.co/Erutsftlnq
https://t.co/86VvLJEzvG
It's not clear from your question what exactly is part of your data. (Does it include the square brackets and single quotes?). In any case, the pandas read_csv function is very versitile and can handle ragged data:
import StringIO
import pandas as pd
raw_data = """
['https://t.co/1u0dkzq2dV', 'https://t.co/3XIZ0SN05Q']
['https://t.co/CJZWjaBfJU']
['https://t.co/4GMhoXhBQO', 'https://t.co/0V']
['https://t.co/Erutsftlnq']
['https://t.co/86VvLJEzvG', 'https://t.co/zCYv5WcFDS']
"""
# You'll probably replace the StringIO part with the filename of your data.
df = pd.read_csv(StringIO.StringIO(raw_data), header=None, names=('flips','row'))
# Get rid of the square brackets and single quotes
for col in ('flips', 'row'):
df[col] = df[col].str.strip("[]'")
df
Output:
flips row
0 https://t.co/1u0dkzq2dV https://t.co/3XIZ0SN05Q
1 https://t.co/CJZWjaBfJU NaN
2 https://t.co/4GMhoXhBQO https://t.co/0V
3 https://t.co/Erutsftlnq NaN
4 https://t.co/86VvLJEzvG https://t.co/zCYv5WcFDS
I have a Pandas DataFrame that contains several string values.
I want to replace them with integer values in order to calculate similarities.
For example:
stores[['CNPJ_Store_Code','region','total_facings']].head()
Out[24]:
CNPJ_Store_Code region total_facings
1 93209765046613 Geo RS/SC 1.471690
16 93209765046290 Geo RS/SC 1.385636
19 93209765044084 Geo PR/SPI 0.217054
21 93209765044831 Geo RS/SC 0.804633
23 93209765045218 Geo PR/SPI 0.708165
and I want to replace region == 'Geo RS/SC' ==> 1, region == 'Geo PR/SPI'==> 2 etc.
Clarification: I want to do the replacement automatically, without creating a dictionary first, since I don't know in advance what my regions will be.
Any ideas? I am trying to use DictVectorizer, with no success.
I'm sure there's a way to do it in intelligent way, but I just can't find it.
Anyone familiar with a solution?
You can use the .apply() function and a dictionary to map all known string values to their corresponding integer values:
region_dictionary = {'Geo RS/SC': 1, 'Geo PR/SPI' : 2, .... }
stores['region'] = stores['region'].apply(lambda x: region_dictionary[x])
It looks to me like you really would like panda categories
http://pandas-docs.github.io/pandas-docs-travis/categorical.html
I think you just need to change the dtype of your text column to "category" and you are done.
stores['region'] = stores["region"].astype('category')
You can do:
df = pd.read_csv(filename, index_col = 0) # Assuming it's a csv file.
def region_to_numeric(a):
if a == 'Geo RS/SC':
return 1
if a == 'Geo PR/SPI':
return 2
df['region_num'] = df['region'].apply(region_to_numeric)