Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 11 months ago.
Improve this question
I want to get out the characters and numbers immediately after the very specific characters "data-permalink=" in a huge text file (50MB). The output should ideally be written in a simple (separate) text file looking something like this:
34k89
456ij
233a4
...
the "data-permalink="" stays always the exact same (as usual in source codes), but the id within can be any combination of characters and numbers. It seemed simple at first, but since it is not at the start of a line, or the needed output is not a separate word I was not able to come up with a working solution at all in the required time. I am running out of time and need a solution or hints to this immediately, so any help is greatly appreciated
example of data in the source data file:
random stuff above
....
I would understand c++ or python the most, so such a solution using these languages would be nice.
I tried something like this:
#include <iostream>
#include <string>
#include <fstream>
using namespace std;
int main()
{
ifstream in ("data.txt");
if(in.fail())
{
cout<<"error";
}
else
{
char c;
while(in.get(c))
{
if(c=="data-permalink=")
cout<<"lol this is awesome"
else
cout<<" ";
}
}
return 0;
}
It is just a random attempt to see if the structure works, nowhere near a solution. This prob. also gives u guys a good guess on how bad i am currently lmao.
Hm, basically 50MB is considered "small" nowadays. With taht small data, you can read the whole file into one std::stringand then do a linear search.
So, the algorithm is:
Open files and check, if they could be opened
Read complete file into a std::string
Do a linear search for the string "data-permalink=""
Remember the start position of the permalink
Search for the closing "
Use the std::strings substrfunction to create the output permalink string
Write this to a file
Goto 1.
I created a 70MB random test file with random data.
The whole procedure takes less than 1s. Even with slow linear search.
But caveat. You want to parse a HTML file. This will most probably not work, because of potential nested structures. For this you should use existing HTML parsers.
Anyway. Here is one of many possible solutions.
#include <iostream>
#include <fstream>
#include <string>
#include <random>
#include <iterator>
#include <algorithm>
std::string randomSourceCharacters{ " abcdefghijklmnopqrstuvwxyz" };
const std::string sourceFileName{ "r:\\test.txt" };
const std::string linkFileName{ "r:\\links.txt" };
void createRandomData() {
std::random_device randomDevice;
std::mt19937 randomGgenerator(randomDevice());
std::uniform_int_distribution<> randomCharacterDistribution(0, randomSourceCharacters.size() - 1);
std::uniform_int_distribution<> randomLength(10, 30);
if (std::ofstream ofs{ sourceFileName }; ofs) {
for (size_t i{}; i < 1000000; ++i) {
const int prefixLength{ randomLength(randomGgenerator) };
const int linkLength{ randomLength(randomGgenerator) };
const int suffixLength{ randomLength(randomGgenerator) };
for (int k{}; k < prefixLength; ++k)
ofs << randomSourceCharacters[randomCharacterDistribution(randomGgenerator)];
ofs << "data-permalink=\"";
for (int k{}; k < linkLength; ++k)
ofs << randomSourceCharacters[randomCharacterDistribution(randomGgenerator)];
ofs << "\"";
for (int k{}; k < suffixLength; ++k)
ofs << randomSourceCharacters[randomCharacterDistribution(randomGgenerator)];
}
}
else std::cerr << "\nError: Could not open source file '" << sourceFileName << "' for writing\n";
}
int main() {
// Please uncomment if you want to create a file with test data
// createRandomData();
// Open source file for reading and check, if file could be opened
if (std::ifstream ifs{ sourceFileName }; ifs) {
// Open link file for writing and check, if file could be opened
if (std::ofstream ofs{ linkFileName }; ofs) {
// Read the complete 50MB file into a string
std::string data(std::istreambuf_iterator<char>(ifs), {});
const std::string searchString{ "data-permalink=\"" };
const std::string permalinkEndString{ "\"" };
// Do a linear search
for (size_t posBegin{}; posBegin < data.length(); ) {
// Search for the begin of the permalink
if (posBegin = data.find(searchString, posBegin); posBegin != std::string::npos) {
const size_t posStartForEndSearch = posBegin + searchString.length() ;
// Search fo the end of the perma link
if (size_t posEnd = data.find(permalinkEndString, posStartForEndSearch); posEnd != std::string::npos) {
// Output result
const size_t lengthPermalink{ posEnd - posStartForEndSearch };
const std::string output{ data.substr(posStartForEndSearch, lengthPermalink) };
ofs << output << '\n';
posBegin = posEnd + 1;
}
else break;
}
else break;
}
}
else std::cerr << "\nError: Could not open source file '" << sourceFileName << "' for reading\n";
}
else std::cerr << "\nError: Could not open source file '" << sourceFileName << "' for reading\n";
}
Edit
If you need unique links you may store the result in an std::unordered_set and then output later.
#include <iostream>
#include <fstream>
#include <string>
#include <iterator>
#include <algorithm>
#include <unordered_set>
const std::string sourceFileName{ "r:\\test.txt" };
const std::string linkFileName{ "r:\\links.txt" };
int main() {
// Open source file for reading and check, if file could be opened
if (std::ifstream ifs{ sourceFileName }; ifs) {
// Open link file for writing and check, if file could be opened
if (std::ofstream ofs{ linkFileName }; ofs) {
// Read the complete 50MB file into a string
std::string data(std::istreambuf_iterator<char>(ifs), {});
const std::string searchString{ "data-permalink=\"" };
const std::string permalinkEndString{ "\"" };
// Here we will store unique results
std::unordered_set<std::string> result{};
// Do a linear search
for (size_t posBegin{}; posBegin < data.length(); ) {
// Search for the begin of the permalink
if (posBegin = data.find(searchString, posBegin); posBegin != std::string::npos) {
const size_t posStartForEndSearch = posBegin + searchString.length();
// Search fo the end of the perma link
if (size_t posEnd = data.find(permalinkEndString, posStartForEndSearch); posEnd != std::string::npos) {
// Output result
const size_t lengthPermalink{ posEnd - posStartForEndSearch };
const std::string output{ data.substr(posStartForEndSearch, lengthPermalink) };
result.insert(output);
posBegin = posEnd + 1;
}
else break;
}
else break;
}
for (const std::string& link : result)
ofs << link << '\n';
}
else std::cerr << "\nError: Could not open source file '" << sourceFileName << "' for reading\n";
}
else std::cerr << "\nError: Could not open source file '" << sourceFileName << "' for reading\n";
}
Related
I am trying to read an hdf5 file containing variable-length vectors of doubles in C++. I used the following code to create the hdf5 file. It contains one dataset called "test" containing 100 rows of varying lengths. I had to make a couple of changes to the code in the link, so for convenience here is the exact code I used to write the data to hdf5:
#include <iostream>
#include <string>
#include <H5Cpp.h>
#include <vector>
#include <random>
const hsize_t n_dims = 1;
const hsize_t n_rows = 100;
const std::string dataset_name = "test";
int main () {
H5::H5File file("vlen_cpp.hdf5", H5F_ACC_TRUNC);
H5::DataSpace dataspace(n_dims, &n_rows);
// target dtype for the file
auto item_type = H5::PredType::NATIVE_DOUBLE;
auto file_type = H5::VarLenType(&item_type);
// dtype of the generated data
auto mem_type = H5::VarLenType(&item_type);
H5::DataSet dataset = file.createDataSet(dataset_name, file_type, dataspace);
std::vector<std::vector<double>> data;
data.reserve(n_rows);
// this structure stores length of each varlen row and a pointer to
// the actual data
std::vector<hvl_t> varlen_spec(n_rows);
std::mt19937 gen;
std::normal_distribution<double> normal(0.0, 1.0);
std::poisson_distribution<hsize_t> poisson(20);
for (hsize_t idx=0; idx < n_rows; idx++) {
data.emplace_back();
hsize_t size = poisson(gen);
data.at(idx).reserve(size);
varlen_spec.at(idx).len = size;
varlen_spec.at(idx).p = (void*) &data.at(idx).front();
for (hsize_t i = 0; i < size; i++) {
data.at(idx).push_back(normal(gen));
}
}
dataset.write(&varlen_spec.front(), mem_type);
return 0;
}
I am very new to C++ and my issue is trying to read the data back out of this file in C++. I tried to mimic what I would do in Python, but didn't have any luck. In Python, I would do this:
import h5py
import numpy as np
data = h5py.File("vlen_cpp.hdf5", "r")
i = 0 # This is the row I would want to read
arr = data["test"][i] # <-- This is the simplest way.
# Now trying to mimic something closer to C++
did = data["test"].id
dataspace = did.get_space()
dataspace.select_hyperslab(start=(i, ), count=(1, ))
memspace = h5py.h5s.create_simple(dims_tpl=(1, ))
memspace.select_hyperslab(start=(0, ), count=(1, ))
arr = np.zeros((1, ), dtype=object)
did.read(memspace, dataspace, arr)
print(arr) # This gives back the correct data
The python code seems to works fine, so I tried to mimic those steps in C++:
#include <H5Cpp.h>
#include <string>
#include <vector>
#include <stdio.h>
int main(int argc, char **argv) {
std::string filename = argv[1];
// memtype of the file
auto itemType = H5::PredType::NATIVE_DOUBLE;
auto memType = H5::VarLenType(&itemType);
// get dataspace
H5::H5File file(filename, H5F_ACC_RDONLY);
H5::DataSet dataset = file.openDataSet("test");
H5::DataSpace dataspace = dataset.getSpace();
// get the size of the dataset
hsize_t rank;
hsize_t dims[1];
rank = dataspace.getSimpleExtentDims(dims); // rank = 1
std::cout << "Data size: "<< dims[0] << std::endl; // this is the correct number of values
// create memspace
hsize_t memDims[1] = {1};
H5::DataSpace memspace(rank, memDims);
// container to store read data
std::vector<std::vector<double>> data;
// Select hyperslabs
hsize_t dataCount[1] = {1};
hsize_t dataOffset[1] = {0}; // this should be i
hsize_t memCount[1] = {1};
hsize_t memOffset[1] = {0};
dataspace.selectHyperslab(H5S_SELECT_SET, dataCount, dataOffset);
memspace.selectHyperslab(H5S_SELECT_SET, memCount, memOffset);
// vector to store read data
std::vector<double> temp;
temp.reserve(20);
dataset.read(temp.data(), memType, memspace, dataspace);
for (int i = 0; i < temp.size(); i++) {
std::cout << temp[i] << ", ";
}
std::cout << "\n";
return 0;
}
Nothing crashes when I run the C++ program, and the correct number of rows in the "test" dataset is printed (100), but the dataset.read() step isn't working: the first row isn't being read into the vector I want it to be read into (temp). I would greatly appreciate if someone could let me know what I'm doing wrong. Thanks so much.
My goal is to eventually read all 100 rows in the dataset in a loop (placing each row of data into the std:vector temp) and store each one in the std::vectorstd::vector<double> called data. But for now I'm just trying to make sure I can even read the first row.
EDIT: link to hdf5 file
"test" dataset looks like this:
[ 0.16371168 -0.21425339 0.29859526 -0.82794418 0.01021543 1.05546644
-0.546841 1.17456768 0.66068215 -1.04944273 1.48596426 -0.62527598
-2.55912244 -0.82908105 -0.53978052 -0.88870719]
[ 0.33958656 -0.48258915 2.10885699 -0.12130623 -0.2873894 -0.37100313
-1.05934898 -2.3014427 1.45502412 -0.06152739 0.92532768 1.35432642
1.51560926 -0.24327452 1.00886476 0.19749707 0.43894484 0.4394992
-0.12814881]
[ 0.64574273 0.14938582 -0.10369248 1.53727461 0.62404949 1.07824824
1.17066933 1.17196281 -2.05005927 0.13639514 -1.45473056 -1.71462623
-1.11552074 -1.73985207 1.12422121 -1.58694009]
...
EDIT 2:
I've additionally tried without any luck to read the data into (array, armadillo vector, eigen vectorXd). The program does not crash, but what is read into the containers is garbage:
#include <H5Cpp.h>
#include <string>
#include <vector>
#include <stdio.h>
#include <Eigen/Dense>
#include <Eigen/Core>
#include <armadillo>
int main(int argc, char **argv) {
std::string filename = argv[1];
// memtype of the file
auto itemType = H5::PredType::NATIVE_DOUBLE;
auto memType = H5::VarLenType(&itemType);
// get dataspace
H5::H5File file(filename, H5F_ACC_RDONLY);
H5::DataSet dataset = file.openDataSet("test");
H5::DataSpace dataspace = dataset.getSpace();
// get the size of the dataset
hsize_t rank;
hsize_t dims[1];
rank = dataspace.getSimpleExtentDims(dims); // rank = 1
std::cout << "Data size: "<< dims[0] << std::endl; // this is the correct number of values
std::cout << "Data rank: "<< rank << std::endl; // this is the correct rank
// create memspace
hsize_t memDims[1] = {1};
H5::DataSpace memspace(rank, memDims);
// Select hyperslabs
hsize_t dataCount[1] = {1};
hsize_t dataOffset[1] = {0}; // this would be i if reading in a loop
hsize_t memCount[1] = {1};
hsize_t memOffset[1] = {0};
dataspace.selectHyperslab(H5S_SELECT_SET, dataCount, dataOffset);
memspace.selectHyperslab(H5S_SELECT_SET, memCount, memOffset);
// Create storage to hold read data
int i;
int NX = 20;
double data_out[NX];
for (i = 0; i < NX; i++)
data_out[i] = 0;
arma::vec temp(20);
Eigen::VectorXd temp2(20);
// Read data into data_out (array)
dataset.read(data_out, memType, memspace, dataspace);
std::cout << "data_out: " << "\n";
for (i = 0; i < NX; i++)
std::cout << data_out[i] << " ";
std::cout << std::endl;
// Read data into temp (arma vec)
dataset.read(temp.memptr(), memType, memspace, dataspace);
std::cout << "arma vec: " << "\n";
std::cout << temp << std::endl;
// Read data into temp (eigen vec)
dataset.read(temp2.data(), memType, memspace, dataspace);
std::cout << "eigen vec: " << "\n";
std::cout << temp2 << std::endl;
return 0;
}
(ONE) SOLUTION:
After struggling with this a lot, I was able to get a solution working, though admittedly I'm too new to C++ to really understand why this works why and the previous attempts didn't:
#include <H5Cpp.h>
#include <string>
#include <vector>
#include <stdio.h>
int main(int argc, char **argv) {
std::string filename = argv[1];
// Set memtype of the file
auto itemType = H5::PredType::NATIVE_DOUBLE;
auto memType = H5::VarLenType(&itemType);
// Get dataspace
H5::H5File file(filename, H5F_ACC_RDONLY);
H5::DataSet dataset = file.openDataSet("test");
H5::DataSpace dataspace = dataset.getSpace();
// Get the size of the dataset
hsize_t rank;
hsize_t dims[1];
rank = dataspace.getSimpleExtentDims(dims); // rank = 1
std::cout << "Data size: "<< dims[0] << std::endl; // this is the correct number of values
std::cout << "Data rank: "<< rank << std::endl; // this is the correct rank
// Create memspace
hsize_t memDims[1] = {1};
H5::DataSpace memspace(rank, memDims);
// Initialize hyperslabs
hsize_t dataCount[1];
hsize_t dataOffset[1];
hsize_t memCount[1];
hsize_t memOffset[1];
// Create storage to hold read data
hvl_t *rdata = new hvl_t[1];
std::vector<std::vector<double>> dataOut;
for (hsize_t i = 0; i < dims[0]; i++) {
// Select hyperslabs
dataCount[0] = 1;
dataOffset[0] = i;
memCount[0] = 1;
memOffset[0] = 0;
dataspace.selectHyperslab(H5S_SELECT_SET, dataCount, dataOffset);
memspace.selectHyperslab(H5S_SELECT_SET, memCount, memOffset);
// Read out the data
dataset.read(rdata, memType, memspace, dataspace);
double* ptr = (double*)rdata[0].p;
std::vector<double> thisRow;
for (int j = 0; j < rdata[0].len; j++) {
double* val = (double*)&ptr[j];
thisRow.push_back(*val);
}
dataOut.push_back(thisRow);
}
// Confirm data read out properly
for (int i = 0; i < dataOut.size(); i++) {
std::cout << "Row " << i << ":\n";
for (int j = 0; j < dataOut[i].size(); j++) {
std::cout << dataOut[i][j] << " ";
}
std::cout << "\n";
}
return 0;
}
If anyone knows a more efficient way that doesn't involve looping over the elements of each row (i.e. pull out an entire row in one go) that would be really helpful, but for now this works fine for me.
This is the Python Code to Execute external python files
exec(open("file.py").read())
How to do it in c
main.cpp
#include <iostream>
#include <string>
#include <list>
class PrivateDriverData {
public:
std::string PythonExecutable = "python3";
std::string exec(std::string command) {
char buffer[128];
std::string result = "";
// Open pipe to file
FILE* pipe = popen(command.c_str(), "r");
if (!pipe) {
return "popen failed!";
}
while (!feof(pipe)) {
if (fgets(buffer, 128, pipe) != NULL)
result += buffer;
}
pclose(pipe);
return result;
}
};
std::string ExecDriver() {
PrivateDriverData LocalPDD;
std::string ScraperExecuteData = LocalPDD.PythonExecutable + " file.py";
return LocalPDD.exec(ScraperExecuteData);
}
int main() {
std::string answer = ExecDriver();
std::cout << answer << "\n";
}
The closet thing C has is dlopen(). which open a compiled and linked dynamic library and provides a way to run the code it contains.
It's not standard C and requires a hosted environment so it's not going to work on Arduino etc.
I'm trying to extract data from .dat (data in file is in 16 bit) file in c++ which is showing garbage data. I'm able to extract it in python (code provided below as well) but my work requires it to be in C++. Here is the C code that I'm using.
Also I would like to know what is the fastest way to extract data since my file are a bit large in size.
#include<iostream>
#define N 4000
using namespace std;
struct record {
char details[1500];
};
int main(int argc, char** argv) {
FILE *fp = fopen("mirror.dat","rb");
record *records;
if (fp==NULL){
cout<<"Problem \n";
system("pause");
return -1;
}
records = new record[N];
fread((record *)records, sizeof(record),N,fp );
fclose(fp);
for(int i=0; i<N;i++){
cout<<"[" << i+1 << "]" << records[i].details << "\n";
}
system("PAUSE");
return 0;
}
Below is the python code.
fpath="mirror.dat"
with open(fpath, 'rb') as r_file:
data=r_file.read()
bits=[data[i+1]<<8 | data[i] for i in range(0, len(data),2)]
print(type(bits))
bits_decod = []
for k in bits:
bits_decod.append(k)
print((bits_decod))
In C++, when you print a char array using <<, it expects it to be a C-style character string.
You need to write a loop that decodes it similarly to the way the Python script does.
#include<iostream>
#define N 4000
using namespace std;
uint8_t data[N * 1500];
uint16_t bits[N * 750];
int main(int argc, char** argv) {
FILE *fp = fopen("mirror.dat","rb");
record *records;
if (fp==NULL){
cout<<"Problem \n";
system("pause");
return 1;
}
size_t data_len = fread((void *)data, sizeof(data),1,fp );
if (data_len < 0) {
cout << "Read error\n";
system("pause");
return 1;
}
fclose(fp);
for (int i = 0; i < data_len; i+=2) {
bits[i/2] = data[i+1] << 8 | data[i];
}
int bits_len = data_len / 2;
for(int i=0; i<bits_len;i++){
cout<<"[" << i+1 << "]" << bits[i] << "\n";
}
system("PAUSE");
return 0;
}
In C++ you can read the contents of a file into a std::vector of uint8_t with the use of std::istream_iterator. Then loop through the vector, decoding the bytes and putting into a vector of uint16_t.
std::istream_iterator<uint8_t>(testFile) is an iterator to beginning of file and std::istream_iterator<uint8_t>() is default-constructed with the special state "end-of-stream". So using this iterator can be used to read from the beginning of the file to the end. We don't have to calculate the size ourselves, and therefore can be used to read varying number of entries in the file.
The equivalent C++ program will look something like this:
#include <iostream>
#include <cstddef>
#include <vector>
#include <iterator>
#include <algorithm>
#include <fstream>
#include <cstdint>
int main()
{
//Open file
std::ifstream testFile("mirror.dat", std::ios::in | std::ios::binary);
if (!testFile)
{
std::cout << "Problem \n";
system("pause");
return 1;
}
//Read in file contents
std::vector<uint8_t> data((std::istream_iterator<uint8_t>(testFile)), std::istream_iterator<uint8_t>());
std::vector<uint16_t> bytes_decoded;
bytes_decoded.reserve(data.size() / 2);
//Decode bytes
for (std::size_t i = 0; i < data.size(); i += 2)
{
bytes_decoded.push_back(data[i + 1] << 8 | data[i]);
}
//Copy decoded bytes to screen with one space between each number
std::copy(bytes_decoded.cbegin(), bytes_decoded.cend(), std::ostream_iterator<uint16_t>(std::cout), " ");
system("PAUSE");
return 0;
}
Note: This requires C++11 or above for the types uint8_t and uint16_t in the header cstdint. You could use unsigned char and unsigned short instead if you don't have a modern C++ compiler.
I have written a program in c using libcurl to load url and send the return value to Python (I am passing 2 integer value from Python to C. i am yet to enhance the code, currently trying the logic and variable accessibility between Python and C.). I am able to compile the program successfully. When i load the module in Python i am getting error saying "undefined symbol: curl_easy_getinfo". Please let me know how to fix the issue.
Code:
#include <Python.h>
#include <stdio.h>
#include <time.h>
#include <stdio.h>
#include <pthread.h>
#include <curl/curl.h>
#define NUMT 4
/*
List of URLs to fetch.
If you intend to use a SSL-based protocol here you MUST setup the OpenSSL
callback functions as described here:
http://www.openssl.org/docs/crypto/threads.html#DESCRIPTION
*/
const char * const urls[NUMT]= {
"http://www.google.com",
"http://www.yahoo.com/",
"http://www.haxx.se/done.html",
"http://www.haxx.se/"
};
#define MINIMAL_PROGRESS_FUNCTIONALITY_INTERVAL 3
struct myprogress {
double lastruntime;
curl_off_t totdnld;
void *url;
CURL *curl;
};
static PyObject *foo1_add(PyObject *self, PyObject *args)
{
int a;
int b;
int s;
if (!PyArg_ParseTuple(args, "ii", &a, &b))
{
return NULL;
}
s = sum (a, b);
return Py_BuildValue("i", s);
// return Py_BuildValue("i", a + b);
}
static PyMethodDef foo1_methods[] = {
{ "add", (PyCFunction)foo1_add, METH_VARARGS, NULL },
{ NULL, NULL, 0, NULL }
};
PyMODINIT_FUNC initfoo1()
{
Py_InitModule3("foo1", foo1_methods, "My first extension module.");
}
int sum(int x, int y) {
int z;
z = x + y;
z = geturl (x, y);
return (z);
}
/* this is how the CURLOPT_XFERINFOFUNCTION callback works */
#ifdef 0
static int xferinfo(void *p,
curl_off_t dltotal, curl_off_t dlnow,
curl_off_t ultotal, curl_off_t ulnow)
{
struct myprogress *myp = (struct myprogress *)p;
CURL *curl = myp->curl;
double curtime = 0;
curl_easy_getinfo(curl, CURLINFO_TOTAL_TIME, &curtime);
/* under certain circumstances it may be desirable for certain functionality
to only run every N seconds, in order to do this the transaction time can
be used */
if((curtime - myp->lastruntime) >= MINIMAL_PROGRESS_FUNCTIONALITY_INTERVAL) {
myp->lastruntime = curtime;
fprintf(stderr, "TOTAL TIME: %f \r\n", curtime);
}
if (dlnow > 0) {
fprintf(stderr, "UP: %" CURL_FORMAT_CURL_OFF_T " of %" CURL_FORMAT_CURL_OFF_T
" DOWN: %" CURL_FORMAT_CURL_OFF_T " of %" CURL_FORMAT_CURL_OFF_T
"\r\n",
ulnow, ultotal, dlnow, dltotal);
}
myp->totdnld = myp->totdnld + dlnow;
if (dlnow > 0) {
fprintf(stderr, "TOTAL Download: %" CURL_FORMAT_CURL_OFF_T " url is: %s \r\n", myp->totdnld, myp->url);
}
// if(dlnow > STOP_DOWNLOAD_AFTER_THIS_MANY_BYTES)
// return 1;
return 0;
}
#endif
/* for libcurl older than 7.32.0 (CURLOPT_PROGRESSFUNCTION) */
static int older_progress(void *p,
double dltotal, double dlnow,
double ultotal, double ulnow)
{
return xferinfo(p,
(curl_off_t)dltotal,
(curl_off_t)dlnow,
(curl_off_t)ultotal,
(curl_off_t)ulnow);
}
static void *pull_one_url(void *url)
{
CURL *curl;
CURLcode res = CURLE_OK;
struct myprogress prog;
curl = curl_easy_init();
if(curl) {
prog.lastruntime = 0;
prog.curl = curl;
prog.url = url;
prog.totdnld = (curl_off_t) 0;
curl_easy_setopt(curl, CURLOPT_URL, url);
curl_easy_setopt(curl, CURLOPT_PROGRESSFUNCTION, older_progress);
/* pass the struct pointer into the progress function */
curl_easy_setopt(curl, CURLOPT_PROGRESSDATA, &prog);
#ifdef 0
#if LIBCURL_VERSION_NUM >= 0x072000
/* xferinfo was introduced in 7.32.0, no earlier libcurl versions will
compile as they won't have the symbols around.
If built with a newer libcurl, but running with an older libcurl:
curl_easy_setopt() will fail in run-time trying to set the new
callback, making the older callback get used.
New libcurls will prefer the new callback and instead use that one even
if both callbacks are set. */
curl_easy_setopt(curl, CURLOPT_XFERINFOFUNCTION, xferinfo);
/* pass the struct pointer into the xferinfo function, note that this is
an alias to CURLOPT_PROGRESSDATA */
curl_easy_setopt(curl, CURLOPT_XFERINFODATA, &prog);
#endif
#endif
curl_easy_setopt(curl, CURLOPT_NOPROGRESS, 0L);
res = curl_easy_perform(curl);
if(res != CURLE_OK)
fprintf(stderr, "%s\n", curl_easy_strerror(res));
curl_easy_cleanup(curl);
}
return NULL;
}
/*
int pthread_create(pthread_t *new_thread_ID,
const pthread_attr_t *attr,
void * (*start_func)(void *), void *arg);
*/
int geturl(int x, int y)
{
pthread_t tid[NUMT];
int i;
int error;
/* Must initialize libcurl before any threads are started */
curl_global_init(CURL_GLOBAL_ALL);
for(i=0; i< NUMT; i++) {
error = pthread_create(&tid[i],
NULL, /* default attributes please */
pull_one_url,
(void *)urls[i]);
if(0 != error)
fprintf(stderr, "Couldn't run thread number %d, errno %d\n", i, error);
else
fprintf(stderr, "Thread %d, gets %s\n", i, urls[i]);
}
/* now wait for all threads to terminate */
for(i=0; i< NUMT; i++) {
error = pthread_join(tid[i], NULL);
fprintf(stderr, "Thread %d terminated\n", i);
}
return (x * y);
}
Command used for compilation:
gcc -lcurl -lpthread -shared -I/usr/include/python2.7 -fPIC sample.c –o add.so
Error:
>>> import foo1
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ImportError: ./foo1.so: undefined symbol: curl_easy_perform
>>>
Try moving -lcurl and -lpthread to after sample.c in your compilation command. The linker resolves symbols in left-to-right order, so references from sample.c (e.g., curl_easy_getinfo) will be resolved from libraries specified after it.
It's better to use -pthread than -lpthread by the way. It sets preprocessor flags to make some functions reentrant for example.
One of the tools I'm using saves the HTTP data into log files per connection. I was wondering if there is some kind of script to inflate gzip compressed messages in the file.
The data looks like this:
GET /something HTTP/1.1
Content-Type: text/plain
User-Agent: Mozilla/5.0
Connection: Keep-Alive
Accept-Encoding: gzip, deflate
Accept-Language: en-US,*
Host: something.somedomain
HTTP/1.1 200 OK
Content-Encoding: gzip
Content-Type: text/xml;charset=UTF-8
Date: Wed, 28 May 2014 20:33:14 GMT
Server: something
Content-Length: 160
Connection: keep-alive
<GZIP SECTION ...
FOLLOWING MORE REQUESTS/RESPONSES
I thought I could do it by hand but that would take too much time. Then I thought I could make a script but since I'm not quite an expert with bash/python/perl/whatever I was hoping somebody already had created such script.
Thanks for any tips.
Well I helped myself out and slapped together a C++ app to do what I want. Perhaps somebody might find it useful some day. Also handles chunked encoding. How I use it 'ls | grep ".log$" | ungzip'. The log files were from SSLSplit.
// ungzip.cpp : Defines the entry point for the console application.
//
#include "stdafx.h"
void inflate(std::istream& dataIn, std::ostream& dataOut)
{
boost::iostreams::filtering_streambuf<boost::iostreams::input> in;
in.push(boost::iostreams::gzip_decompressor());
in.push(dataIn);
boost::iostreams::copy(in, dataOut);
}
struct membuf : std::streambuf
{
membuf(char* begin, char* end) {
this->setg(begin, begin, end);
}
};
int _tmain(int argc, _TCHAR* argv[])
{
boost::iostreams::mapped_file fileIn;
std::ofstream fileOut;
// For each filename on stdin
for (std::string fileName; std::getline(std::cin, fileName);)
{
// Try opening memory mapping of that file.
try
{
fileIn.open(fileName);
if (fileIn.is_open() == false)
{
std::cout << "Error 1" << std::endl;
continue;
}
}
catch (std::exception e)
{
std::cout << e.what();
continue;
}
// Open file to write inflated output to
std::string strOut = fileName;
strOut += ".ugz";
fileOut.open(strOut, std::ios::binary);
if (fileOut.is_open() == false)
{
std::cout << "Error 2" << std::endl;
fileIn.close();
continue;
}
// Load whole file into string to verify if it atleast has HTTP/1.1 somewhere in it.
//Doesnt mean its not binary, but better than nothing.
char * pchData = fileIn.data();
std::string strWhole(pchData, pchData + fileIn.size());
std::regex reg("HTTP/1.1 ");
std::smatch match;
std::stringstream ss(strWhole);
// Interesting header information
enum {REGXCNT = 3};
std::regex regs[REGXCNT] = { std::regex("Content-Length: (\\d+)"), std::regex("Content-Encoding: gzip"), std::regex("Transfer-Encoding: chunked") };
// Verify
if (std::regex_search(strWhole, match, reg))
{
int len = 0;
bool bGzipped = false;
bool bChunked = false;
// While there is something to read
while (!ss.eof())
{
std::string strLine;
std::getline(ss, strLine);
// Empty line between Header and Body
if (strLine == "\r")
{
// Print out the empty line \r\n
fileOut << strLine << std::endl;
// If its gzipped or chunked treat it differently
if (bGzipped || bChunked)
{
// GZipped but not chunked
if (bGzipped && !bChunked)
{
// Construct helper structures inflate and write out
char * pbyBinaryData = new char[len];
ss.read(pbyBinaryData, len);
std::stringbuf stringBuf;
membuf gzipdata(pbyBinaryData, pbyBinaryData + len);
std::istream _in(&gzipdata);
std::ostream _out(&stringBuf);
inflate(_in, _out);
std::stringstream ssOut;
ssOut << _out.rdbuf();
std::string strDataOut = ssOut.str();
fileOut.write(strDataOut.c_str(), strDataOut.length());
delete [] pbyBinaryData;
}
// Chunked data goes here
else if (bChunked)
{
// This vector is used for gzipped data
std::vector<char> unchunkedData;
// Load all chunks
while (true)
{
std::getline(ss, strLine);
// Strip \r from it. It should be always at the end, but whatever - performance is not the issue
strLine.erase(std::remove(strLine.begin(), strLine.end(), '\r'), strLine.end());
// Load chunksize
int nChunkSize = std::stoi(strLine, 0, 16);
if (nChunkSize != 0)
{
// Each chunk is ended \r\n -> +2
char * tmpBuf = new char[nChunkSize + 2];
// Read actual data
ss.read(tmpBuf, nChunkSize + 2);
if (!bGzipped)
{
//Data not gzipped. Write them out directly
fileOut.write(tmpBuf, nChunkSize);
}
else
{
//Data gzipped. Add them to vector to decompress later
unchunkedData.insert(unchunkedData.end(), tmpBuf, tmpBuf + nChunkSize);
}
delete[] tmpBuf;
}
else
{
// All chunks loaded. Break the while loop.
break;
}
}
// Data was gzipped. Time to decompress
if (bGzipped)
{
std::stringbuf stringBuf;
membuf gzipdata(unchunkedData.data(), unchunkedData.data()+unchunkedData.size());
std::istream _in(&gzipdata);
std::ostream _out(&stringBuf);
inflate(_in, _out);
std::stringstream ssOut;
ssOut << _out.rdbuf();
std::string strDataOut = ssOut.str();
fileOut.write(strDataOut.c_str(), strDataOut.length());
}
}
}
// Reset flags
bChunked = false;
len = 0;
bGzipped = false;
}
// Otherwise just save it and try to find a key header info in it
else
{
fileOut << strLine << std::endl;
for (int i = 0; i < REGXCNT; ++i)
{
if (std::regex_search(strLine, match, regs[i]))
{
switch (i)
{
case 0:
len = std::stoi(match[1]);
break;
case 1:
bGzipped = true;
break;
case 2:
bChunked = true;
break;
}
break;
}
}
}
}
}
fileOut.flush();
fileIn.close();
fileOut.close();
}
return 0;
}
Header stdafx.h:
#pragma once
#pragma warning (disable: 4244)
#include <tchar.h>
#include <iostream>
#include <boost/tokenizer.hpp>
#include <boost/iostreams/filtering_streambuf.hpp>
#include <boost/iostreams/filter/gzip.hpp>
#include <boost/iostreams/copy.hpp>
#include <boost/iostreams/stream.hpp>
#include <boost/iostreams/device/mapped_file.hpp>
#include <fstream>
#include <regex>
#include <vector>
#include <streambuf>
#include <sstream>