Update pandas dataframe column with substring from another dataframe - python

I have a dataframe where some of the informations are on the wrong field and I need to change it to the wright column. The problem is that those informations are user inserted, so they are in the middle of strings and in differents ways. The df below is a small example of the problem:
| String | Info A | Info B |
|-------------------------|------------------|------------------|
| 'some text' | 50 | 60 |
| 'A=70, B=80, text' | | |
| 'text, A = 10m, B:20' | | |
The actual version of the df has 10 variables that I need to change and about 2mi rows.
What I need to do is put those informations on the right field, as shown on the first row of the example.
I tried some things but they all had a errors or would take to much time. If someone could help me think of a solution, I would really appreciate.

You can use a regex with str.extractall to get the variable names and values, then pivot and update:
variables = ['A', 'B']
regex = fr'\b({"|".join(variables)})\s*[=:]\s*(\d+)'
# '\\b(A|B)\\s*[=:]\\s*(\\d+)'
df.update(df['String']
.str.extractall(regex)
.reset_index(level=0)
.pivot(index='level_0', columns=0, values=1)
.add_prefix('Info ')
)
output:
String Info A Info B
0 some text 50.0 60.0
1 A=70, B=80, text 70 80
2 text, A = 10m, B:20 10 20

Here is a simple code that can help you. For your example, you can do :
import pandas as pd
df = pd.DataFrame(columns=["String","Info A","Info B"])
df["String"]=['some text','A=70, B=80, text', 'text, A = 10m, B:20']
df["Info A"]=[50,None,None]
df["Info B"]=[60,None,None]
list_strings = list(df["String"])
new_df = pd.DataFrame(columns=["String","Info A","Info B"])
for str_ in list_strings:
Nones = [None]*len(list(df.columns))
dict_dummy = dict(zip(list(df.columns),Nones))
split_str = str_.split(sep=",")
for splited_elm in split_str :
if "A" in splited_elm and ("=" in splited_elm or ":" in splited_elm):
dict_dummy["Info A"] = splited_elm
elif "B" in splited_elm and ("=" in splited_elm or ":" in splited_elm):
dict_dummy["Info B"] = splited_elm
else:
dict_dummy["String"] = splited_elm
new_df=new_df.append(dict_dummy,ignore_index=True)
Output :
new_df
Out[47]:
String Info A Info B
0 some text None None
1 text A=70 B=80
2 text A = 10m B:20
This little script help you to classify your elements. You can do another processing to make your df better.

Related

Importing csv formatted for excel into a dataframe

I am receiving datafiles from 2 different people and the files are coming through with different formats despite both users using the same system and the same browser.
I would like to be able to make my code smart enough to read either format but so far I have been unsuccessful.
The data coming through I am having issues with looks like this
+----------------+---------------+--------------+
| Customer Name | Customer code | File Ref |
+----------------+---------------+--------------+
| ACCOUNT SET UP | ="35" | R2I0025715 |
+----------------+---------------+--------------+
| Xenox | ="4298" | ="913500999" |
+----------------+---------------+--------------+
and the data that is importing cleanly looks like this
+----------------+---------------+------------+
| Customer Name | Customer code | File Ref |
+----------------+---------------+------------+
| ACCOUNT SET UP | 35 | R2I0025715 |
+----------------+---------------+------------+
| Xenox | 4298 | 913500999 |
+----------------+---------------+------------+
I am trying to import the data with the following code pd.read_csv(f, encoding='utf-8', dtype={"Customer Name": "string", "Customer code": "string", "File Ref": "string"})
A workaround that I am using is opening each csv in excel, and saving. But when this involves hundreds of files, it isn't really a workaround.
Can anyone help?
You could use the standard strip() function to remove leading and trailing = and " characters on all of your columns.
For example:
import pandas as pd
data = {
'Customer Name' : ['ACCOUNT SET UP', 'Xenox', 'ACCOUNT SET UP', 'Xenox'],
'Customer Code': ['="35"', '="4298"', '35', '4298'],
'File Ref': ['R2I0025715', '="913500999"', 'R2I0025715', '913500999']
}
df = pd.DataFrame(data)
for col in df.columns:
df[col] = df[col].str.strip('="')
print(df)
Giving you:
Customer Name Customer Code File Ref
0 ACCOUNT SET UP 35 R2I0025715
1 Xenox 4298 913500999
2 ACCOUNT SET UP 35 R2I0025715
3 Xenox 4298 913500999
If you just want to apply it to specific columns, use:
for col in ['Customer Code', 'File Ref']:
df[col] = df[col].str.strip('="')
My Solution:
import re
import pandas as pd
def removechar(x):
x = str(x)
out = re.sub('="', '', x)
return(out)
def removechar2(x):
x = str(x)
out = re.sub('"', '', x)
out = int(out) #could use float(), depends on what you want
return(out)
#then use applymap from pandas
Example:
datas = {'feature1': ['="23"', '="24"', '="23"', '="83"'], 'feature2': ['="23"', '="2"', '="3"', '="23"']}
test = pd.DataFrame(datas) # Example dataframe
test
Out[1]:
feature1 feature2
0 ="23" ="23"
1 ="24" ="2"
2 ="23" ="3"
3 ="83" ="23"
#applymap my functions
test = test.applymap(removechar)
test = test.applymap(removechar2)
test
Out[2]:
feature1 feature2
0 23 23
1 24 2
2 23 3
3 83 23
#fixed
Note you could probably do it just one line of applymap and one function running re.sub, try googling and reading the documentation for re.sub, this was something quick I whipped up.

Filter Header 2 rows and Trailer 1 row in 1000 of huge files pyspark

I have list multiple 1000's of huge files in a folder ..
Each file is having 2 header rows and trailer row
file1
H|*|F|*|TYPE|*|EXTRACT|*|Stage_|*|2021.04.18 07:35:26|##|
H|*|TYP_ID|*|TYP_DESC|*|UPD_USR|*|UPD_TSTMP|##|
E|*||*|CONNECTOR|*|2012.06.01 09:03:11|##|
H|*|Tracking|*|asdasasd|*|2011.03.04 11:50:51|##|
S|*|Tracking|*|asdasdas|*|2011.03.04 11:51:06|##|
T|*|3|*|2021.04.18 07:35:43|##|
file 2
H|*|F|*|PA__STAT|*|EXTRACT|*|Folder|*|2021.04.18 07:35:26|##|
H|*|STAT_ID|*|STAT_DESC|*|UPD_USR|*|UPD_TSTMP|##|
A|*|Active / Actif|*|1604872|*|2018.06.25 15:12:35|##|
D|*||*|CONNECTOR|*|2012.04.06 10:49:09|##|
I|*|Intermittent Leave|*|asdasda|*|2021.04.09 13:14:00|##|
L|*|On Leave|*|asdasasd|*|2011.03.04 11:49:40|##|
P|*|Paid Leave|*|asdasd|*|2011.03.04 11:49:56|##|
T|*|Terminated / Terminé|*|1604872|*|2018.06.25 15:13:06|##|
U|*||*|CONNECTOR|*|2012.06.16 09:04:14|##|
T|*|7|*|2021.04.18 07:35:55|##|
file3
H|*|K|*|PA_CPN|*|EXTRACT|*|SuccessFactors|*|2021.04.22 23:09:26|##|
H|*|COL_NUM|*|CPNT_TYP_ID|*|CPNT_ID|*|REV_DTE|##|
40|*|OLL|*|asdasdas|*|2019.01.21 14:07:00|##|
40|*|OLL|*|asdasda|*|2019.01.21 14:18:00|##|
40|*|OLL|*|asdasdas|*|2019.01.21 14:20:00|##|
T|*|3|*|2021.04.22 23:27:17|##|
I am applying a filter on lines starting with H|| and T|| but it is rejecting the data for few rows.
df_cleanse=spark.sql("select replace(replace(replace(value,'~','-'),'|*|','~'),'|##|','') as value from linenumber3 where value not like 'T|*|%' and value not like 'H|*|%'")
I know we can use zipwithindex , but i have to read file by file and and they apply zip index and then filter on the rows .
for each file:
df = spark.read.text('file1')
#Adding index column each row get its row numbers , Spark distibutes the data and to maintain the order of data we need to perfrom this action
df_1 = df.rdd.map(lambda r: r).zipWithIndex().toDF(['value', 'index'])
df_1.createOrReplaceTempView("linenumber")
spark.sql("select * from linenumber where index >1 and value.value not like 'T|*|%'")
Please let know the optimal solution for the same. I do not want to run a extensive program all i need is to juts remove 3 lines . Even a regex to remove the rows is fine we need to process TB's of files in this format
Unix Commands and Sed operators are ruled out due to the file sizes
Meanwhile I wait your answer, try this to remove the first two lines and the last:
from pyspark.sql.window import Window
import pyspark.sql.functions as f
df = spark.read.csv('your_path', schema='value string')
df = df.withColumn('filename', f.input_file_name())
df = df.repartition('filename')
df = df.withColumn('index', f.monotonically_increasing_id())
w = Window.partitionBy('filename')
df = (df
.withColumn('remove', (f.col('index') == f.max('index').over(w)) | (f.col('index') < f.min('index').over(w) + f.lit(2)))
.where(~f.col('remove'))
.select('value'))
df.show(truncate=False)
Output
+-------------------------------------------------------------+
|value |
+-------------------------------------------------------------+
|E|*||*|CONNECTOR|*|2012.06.01 09:03:11|##| |
|H|*|Tracking|*|asdasasd|*|2011.03.04 11:50:51|##| |
|S|*|Tracking|*|asdasdas|*|2011.03.04 11:51:06|##| |
|A|*|Active / Actif|*|1604872|*|2018.06.25 15:12:35|##| |
|D|*||*|CONNECTOR|*|2012.04.06 10:49:09|##| |
|I|*|Intermittent Leave|*|asdasda|*|2021.04.09 13:14:00|##| |
|L|*|On Leave|*|asdasasd|*|2011.03.04 11:49:40|##| |
|P|*|Paid Leave|*|asdasd|*|2011.03.04 11:49:56|##| |
|T|*|Terminated / Terminé|*|1604872|*|2018.06.25 15:13:06|##||
|U|*||*|CONNECTOR|*|2012.06.16 09:04:14|##| |
|40|*|OLL|*|asdasdas|*|2019.01.21 14:07:00|##| |
|40|*|OLL|*|asdasda|*|2019.01.21 14:18:00|##| |
|40|*|OLL|*|asdasdas|*|2019.01.21 14:20:00|##| |
+-------------------------------------------------------------+

How to find the correlation between columns starting with Open_ and Close_?

I'm trying to find the correlation between the open and close prices of 150 cryptocurrencies using pandas.
Each cryptocurrency data is stored in its own CSV file and looks something like this:
|---------------------|------------------|------------------|
| Date | Open | Close |
|---------------------|------------------|------------------|
| 2019-02-01 00:00:00 | 0.00001115 | 0.00001119 |
|---------------------|------------------|------------------|
| 2019-02-01 00:05:00 | 0.00001116 | 0.00001119 |
|---------------------|------------------|------------------|
| . | . | . |
I would like to find the correlation between the Close and Open column of every cryptocurrency.
As of right now, my code looks like this:
temporary_dataframe = pandas.DataFrame()
for csv_path, coin in zip(all_csv_paths, coin_name):
data_file = pandas.read_csv(csv_path)
temporary_dataframe[f"Open_{coin}"] = data_file["Open"]
temporary_dataframe[f"Close_{coin}"] = data_file["Close"]
# Create all_open based on temporary_dataframe data.
corr_file = all_open.corr()
print(corr_file.unstack().sort_values().drop_duplicates())
Here is a part of the output (the output has a shape of (43661,)):
Open_QKC_BTC Close_QKC_BTC 0.996229
Open_TNT_BTC Close_TNT_BTC 0.996312
Open_ETC_BTC Close_ETC_BTC 0.996423
The problem is that I don't want to see the following correlations:
between columns starting with Close_ and Close_(e.g. Close_USD_BTC and Close_ETH_BTC)
between columns starting with Open_ and Open_ (e.g. Open_USD_BTC and Open_ETH_BTC)
between the same coin (e.g. Open_USD_BTC and Close_USD_BTC).
In short, the perfect output would look like this:
Open_TNT_BTC Close_QKC_BTC 0.996229
Open_ETH_BTC Close_TNT_BTC 0.996312
Open_ADA_BTC Close_ETC_BTC 0.996423
(PS: I'm pretty sure this is not the most elegant to do what I'm doing. If anyone has any suggestions on how to make this script better I would be more than happy to hear them)
Thank you very much in advance for your help!
This is quite messy but it at least shows you an option.
Her i am generating some random data and have made some suffixes (coin names) easier than in your case
import string
import numpy as np
import pandas as pd
#Generate random data
prefix = ['Open_','Close_']
suffix = string.ascii_uppercase #All uppercase letter to simulate coin-names
var1 = [None] * 100
var2 = [None] * 100
for i in range(len(var1)) :
var1[i] = prefix[np.random.randint(0,len(prefix))] + suffix[np.random.randint(0,len(suffix))]
var2[i] = prefix[np.random.randint(0,len(prefix))] + suffix[np.random.randint(0,len(suffix))]
df = pd.DataFrame(data = {'var1': var1, 'var2':var2 })
df['DropScenario_1'] = False
df['DropScenario_2'] = False
df['DropScenario_3'] = False
df['DropScenario_Final'] = False
df['DropScenario_1'] = df.apply(lambda row: bool(prefix[0] in row.var1) and (prefix[0] in row.var2), axis=1) #Both are Open_
df['DropScenario_2'] = df.apply(lambda row: bool(prefix[1] in row.var1) and (prefix[1] in row.var2), axis=1) #Both are Close_
df['DropScenario_3'] = df.apply(lambda row: bool(row.var1[len(row.var1)-1] == row.var2[len(row.var2)-1]), axis=1) #Both suffixes are the same
#Combine all scenarios
df['DropScenario_Final'] = df['DropScenario_1'] | df['DropScenario_2'] | df['DropScenario_3']
#Keep only the part of the df that we want
df = df[df['DropScenario_Final'] == False]
#Drop our messy columns
df = df.drop(['DropScenario_1','DropScenario_2','DropScenario_3','DropScenario_Final'], axis = 1)
Hope this helps
P.S If you find the secret key to trading bitcoins without ending up on r/wallstreetbets, ill take 5% ;)

How to generate table using Python

I am quite struggling with as I tried many libraries to print table but no success - so I thought to post here and ask.
My data is in a text file (resource.txt) which looks like this (the exact same way it prints)
pipelined 8 8 0 17 0 0
nonpipelined 2 2 0 10 0 0
I want my data print in the following manner
Design name LUT Lut as m Lut as I FF DSP BRAM
-------------------------------------------------------------------
pipelined 8 8 0 17 0 0
Non piplined 2 2 0 10 0 0
Some time data may be more line column remain same but rows may increase.
(i have python 2.7 version)
I am using this part in my python code all code working but am couldn't able print data which i extracted to text file in tabular form. As I can't use panda library as it won't support for python 2.7, but I can use tabulate and all library. Can anyone please help me?
I tried using tabulate and all but I keep getting errors.
I tried at end simple method to print but its not working (same code works if I put at top of code but at the end of code this won't work). Does anyone have any idea?
q11=open( "resource.txt","r")
for line in q11:
print(line)
Here's a self contained function that makes a left-justified, technical paper styled table.
def makeTable(headerRow,columnizedData,columnSpacing=2):
"""Creates a technical paper style, left justified table
Author: Christopher Collett
Date: 6/1/2019"""
from numpy import array,max,vectorize
cols = array(columnizedData,dtype=str)
colSizes = [max(vectorize(len)(col)) for col in cols]
header = ''
rows = ['' for i in cols[0]]
for i in range(0,len(headerRow)):
if len(headerRow[i]) > colSizes[i]: colSizes[i]=len(headerRow[i])
headerRow[i]+=' '*(colSizes[i]-len(headerRow[i]))
header+=headerRow[i]
if not i == len(headerRow)-1: header+=' '*columnSpacing
for j in range(0,len(cols[i])):
if len(cols[i][j]) < colSizes[i]:
cols[i][j]+=' '*(colSizes[i]-len(cols[i][j])+columnSpacing)
rows[j]+=cols[i][j]
if not i == len(headerRow)-1: rows[j]+=' '*columnSpacing
line = '-'*len(header)
print(line)
print(header)
print(line)
for row in rows: print(row)
print(line)
And here's an example using this function.
>>> header = ['Name','Age']
>>> names = ['George','Alberta','Frank']
>>> ages = [8,9,11]
>>> makeTable(header,[names,ages])
------------
Name Age
------------
George 8
Alberta 9
Frank 11
------------
Since the number of columns remains the same, you could just print out the first line with ample spaces as required. Ex-
print("Design name",' ',"LUT",' ',"Lut as m",' ',"and continue
like that")
Then read the csv file. datafile will be
datafile = open('resource.csv','r')
reader = csv.reader(datafile)
for col in reader:
print(col[0],' ',col[1],' ',col[2],' ',"and
continue depending on the number of columns")
This is not he optimized solution but since it looks like you are new, therefore this will help you understand better. Or else you can use row_format print options in python 2.7.
Here is code to print table in nice table, you trasfer all your data to sets then you can data or else you can trasfer data in text file line to one set and print it
from beautifultable import BeautifulTable
h0=["jkgjkg"]
h1=[2,3]
h2=[2,3]
h3=[2,3]
h4=[2,3]
h5=[2,3]
h0.append("FPGA resources")
table = BeautifulTable()
table.column_headers = h0
table.append_row(h1)
table.append_row(h2)
table.append_row(h3)
table.append_row(h4)
table.append_row(h5)
print(table)
Out Put:
+--------+----------------+
| jkgjkg | FPGA resources |
+--------+----------------+
| 2 | 3 |
+--------+----------------+
| 2 | 3 |
+--------+----------------+
| 2 | 3 |
+--------+----------------+
| 2 | 3 |
+--------+----------------+
| 2 | 3 |
+--------+----------------+

Using regex to extract information from a large SFrame or dataframe without using a loop

I have the following code in which I use a loop to extract some information and use these information to create a new matrix. However, because I am using a loop, this code takes forever to finish.
I wonder if there is a better way of doing this by using GraphLab's SFrame or pandas dataframe. I appreciate any help!
# This is the regex pattern
pattern_topic_entry_read = r"\d{15}/discussion_topics/(?P<topic>\d{9})/entries/(?P<entry>\d{9})/read"
# Using the pattern, I filter my records
requests_topic_entry_read = requests[requests['url'].apply(lambda x: False if regex.match(pattern_topic_entry_read, x) == None else True)]
# Then for each record in the final set,
# I need to extract topic and entry info using match.group
for request in requests_topic_entry_read:
for match in regex.finditer(pattern_topic_entry_read, request['url']):
topic, entry = match.group('topic'), match.group('entry')
# Then, I need to create a new SFrame (or dataframe, or anything suitable)
newRow = gl.SFrame({'user_id':[request['user_id']],
'url':[request['url']],
'topic':[topic], 'entry':[entry]})
# And, append it to my existing SFrame (or dataframe)
entry_read_matrix = entry_read_matrix.append(newRow)
Some sample data:
user_id | url
1000 | /123456832960900/discussion_topics/770000832912345/read
1001 | /123456832960900/discussion_topics/770000832923456/view?per_page=832945307
1002 | /123456832960900/discussion_topics/770000834562343/entries/832350330/read
1003 | /123456832960900/discussion_topics/770000534344444/entries/832350367/read
I want to obtain this:
user_id | topic | entry
1002 | 770000834562343 | 832350330
1003 | 770000534344444 | 832350367
Pandas' series has string functions for that. E.g., with your data in df:
pattern = re.compile(r'.*/discussion_topics/(?P<topic>\d+)(?:/entries/(?P<entry>\d+))?')
df = pd.read_table(io.StringIO(data), sep=r'\s*\|\s*', index_col='user_id')
df.url.str.extract(pattern, expand=True)
yields
topic entry
user_id
1000 770000832912345 NaN
1001 770000832923456 NaN
1002 770000834562343 832350330
1003 770000534344444 832350367
Here, let me reproduce it:
>>> import pandas as pd
>>> df = pd.DataFrame(columns=["user_id","url"])
>>> df.user_id = [1000,1001,1002,1003]
>>> df.url = ['/123456832960900/discussion_topics/770000832912345/read', '/123456832960900/discussion_topics/770000832923456/view?per_page=832945307', '/123456832960900/discussion_topics/770000834562343/entries/832350330/read','/123456832960900/discussion_topics/770000534344444/entries/832350367/read']
>>> df["entry"] = df.url.apply(lambda x: x.split("/")[-2] if "entries" in x.split("/") else "---")
>>> df["topic"] = df.url.apply(lambda x: x.split("/")[-4] if "entries" in x.split("/") else "---")
>>> df[df.entry!="---"]
gives you desired DataFrame

Categories