Big Data

Description

Aucune description disponible pour cet axe de recherche.

Publications

  • 2024
    Salah Ghodhbani

    A New Multimodal and Spatio-Temporal Dataset for Traffic Control: Development, Analysis, and Potential Applications

    The dataset provides a comprehensive view of traffic behavior at specific junctions, enabling detailed analysis and real-world applications. By integrating previously disparate data sources, this dataset offers a valuable resource for understanding and op, 2024

    Résumé

    Multimodal data, which includes various data formats such as image, video, text, and sensor data, is essential for urban traffic management. The lack of proven multimodal transportation data has been a significant challenge for urban planners, leading to biased or incomplete estimates of travel demand, mode choice, and network performance. Multimodal data integration offers a valuable resource for understanding and optimizing traffic control and management. However, the heterogeneity of the data, various kinds of noise, alignment of modalities, and techniques to handle missing data are some of the challenges that arise. This paper presents a novel multimodal dataset which is the first of its kind, its scraped from England Highways, incorporating speed, flow, and camera images for the M60, M25, and M1 motorways. The dataset provides a comprehensive view of traffic behavior at specific junctions, enabling detailed analysis and real-world applications. By integrating previously disparate data sources, this dataset offers a valuable resource for understanding and optimizing traffic control and management. The paper outlines the dataset's development, including the gathering of speed and flow data, and the use of image scraping techniques to capture CCTV images. The potential applications of the dataset for traffic control, planning, and optimization are also discussed. Overall, this multimodal dataset represents a significant contribution to the field, with implications for the development of advanced traffic management systems and the improvement of transportation infrastructure

    Salah Ghodhbani, Sabeur Elkosantini

    A Spatial-Temporal DLApproach for Traffic Flow Prediction Using Attention Fusion Method

    The proposed model can extract comprehensive features from various transportation data and effectively capture the spatial-temporal dependencies. By merging these features, it aims to generate more accurate and robust traffic flow predictions. This method, 2024

    Résumé

    in recent years, traffic flow prediction has presented challenges in the management of transportation systems. It is a crucial part of Intelligent Transportation Systems (ITS). The complexities of various transportation data, spatial and temporal dependencies on road networks, and multimodalities, such as public transit, pedestrian flow, and bike sharing, make it a challenging task to forecast traffic flow accurately. Numerous works have been introduced to address these challenges, but few have simultaneously considered these factors, resulting in limited success. In this study, a model is proposed to integrate Graph Convolutional Networks (GCN) and Bidirectional Long Short-Term Memory (BiLSTM). This model utilizes the advantages of GCN in handling spatial data and capturing dependencies in road networks, combined with BiLSTM's capability in learning temporal dynamics. The proposed model can extract comprehensive features from various transportation data and effectively capture the spatial-temporal dependencies. By merging these features, it aims to generate more accurate and robust traffic flow predictions. This method addresses the limitations of existing methods that fail to consider spatial-temporal dependencies and multimodalities, leading to improved prediction accuracy and efficiency

  • Marwa Chabbouh, Slim Bechikh, Lamjed Ben Said, Efrén Mezura-Montes

    Imbalanced multi-label data classification as a bi-level optimization problem: application to miRNA-related diseases diagnosis

    Neural Comput. Appl. 35(22): 16285-16303 (2023), 2023

    Résumé

    In multi-label classification, each instance could be assigned multiple labels at the same time. In such a situation, the relationships between labels and the class imbalance are two serious issues that should be addressed. Despite the important number of existing multi-label classification methods, the widespread class imbalance among labels has not been adequately addressed. Two main issues should be solved to come up with an effective classifier for imbalanced multi-label data. On the one hand, the imbalance could occur between labels and/or within a label. The “Between-labels imbalance” occurs where the imbalance is between labels however the “Within-label imbalance” occurs where the imbalance is in the label itself and it could occur across multiple labels. On the other hand, the labels’ processing order heavily influences the quality of a multi-label classifier. To deal with these challenges, we propose in this paper a bi-level evolutionary approach for the optimized induction of multivariate decision trees, where the upper-level role is to design the classifiers while the lower-level approximates the optimal labels’ ordering for each classifier. Our proposed method, named BIMLC-GA (Bi-level Imbalanced Multi-Label Classification Genetic Algorithm), is compared to several state-of-the-art methods across a variety of imbalanced multi-label data sets from several application fields and then applied on the miRNA-related diseases case study. The statistical analysis of the obtained results shows the merits of our proposal.

  • Emna Hosni, Wided Lejouad Chaari, Nader Kolsi, Khaled Ghedira

    Effective Resource Utilization in Heterogeneous Hadoop Environment Through a Dynamic Inter-cluster and Intra-cluster Load Balancing

    Asian Conference on Intelligent Information and Database Systems (ACIIDS), Ho Chi Minh City, Vietnam, part 2 669-681., 2022

    Résumé

    Apache Hadoop is one of the most popular distributed computing systems, used largely for big data analysis and processing. The Hadoop cluster hosts multiple parallel workloads requiring various resource usage (CPU, RAM, etc.). In practice, in heterogeneous Hadoop environments, resource-intensive tasks may be allocated to the lower performing nodes, causing load imbalance between and within clusters and and high data transfer cost. These weaknesses lead to performance deterioration of the Hadoop system and delays the completion of all submitted jobs. To overcome these challenges, this paper proposes an efficient and dynamic load balancing policy in a heterogeneous Hadoop YARN cluster. This novel load balancing model is based on clustering nodes into subgroups of nodes similar in performance, and then allocating different jobs in these subgroups using a multi-criteria ranking. This policy ensures the most accurate match between resource demands and available resources in real time, which decreases the data transfer in the cluster. The experimental results show that the introduced approach allows reducing noticeably the completion time s by 42% and 11% compared with the H-fair and a load balancing approach respectively. Thus, Hadoop can rapidly release the resources for the next job which enhance the overall performance of the distributed computing systems. The obtained finding also reveal that our approach optimizes the use of the available resources and avoids cluster over-load in real time.

  • Haithem Mezni, Mokhtar Sabeur, Sabeur Aridhi, Faouzi Ben Charrada

    Towards big services: a synergy between service computing and parallel programming

    Computing, 2021

    Résumé

    Over the last years, cloud computing has emerged as a natural choice to host, manage, and provide various kinds of virtualized resources (e.g., software, business processes, databases, platforms, mobile and social applications, etc.) as on-demand services. This “servicelization” across various domains has produced a huge volume of data, leading to the emergence of a new service model, called big service. This latter consists of the encapsulation, abstraction and the processing of big data, allowing then to hide their complexity. However, this promising approach still lacks management facilities and tools. Indeed, due to the highly dynamic and uncertain nature of their hosting cloud environments, big services together with their accessed data need continuous management operations, so that to maintain a moderate state and high quality of their execution. In this context, frameworks for designing, composing, executing and managing big services become a major need. The purpose of this paper is to provide an understanding of the new emerging big service model from the lifecycle management phases’ point of view. We also study the role of big data frameworks and multi-cloud strategies in the provisioning of big services. A research road map on this topic will be summarized at the end of this paper.

    Mokhtar Haithem, Haithem Mezni, Mohand Said Hacid, Mohamed Mohsen Gammoudi

    Clustering-based data placement in cloud computing: a predictive approach

    Cluster Computing, 2021

    Résumé

    Nowadays, cloud computing environments have become a natural choice to host and process a huge volume of data. The combination of cloud computing and big data frameworks is an effective way to run data-intensive applications and tasks. Also, an optimal arrangement of data partitions can improve the tasks executions, which is not the case in most big data frameworks. For example, the default distribution of data partitions in Hadoop-based clouds causes several problems, which are mainly related to the load balancing and the resource usage. In addition, most existing data placement solutions are static and lack precision in the placement of data partitions. To overcome these issues, we propose a data placement approach based on the prediction of the future resources usage. We exploit Kernel Density Estimation (KDE) and Fuzzy FCA techniques to, first, forecast the workers’ and tasks’ future resource consumption and, second, cluster data partitions and intensive jobs according to the estimated resource usage. Fuzzy FCA is also used to exclude partitions and jobs that require less resources, which will reduce the needless migrations. To allow monitoring and predicting the workers’ states and the data partitions’ consumption, we modeled the big data cluster as an autonomic service-based system. The obtained results have shown that our solution outperformed existing approaches in terms of migrations rate and resource consumption.

  • Mokhtar Sellami, Haithem Mezni, Mohand Said Hacid

    On the use of big data frameworks for big service composition

    Journal of Network and Computer Applications, 2020

    Résumé

    Over the last years, big data has emerged as a new paradigm for the processing and analysis of massive volumes of data. Big data processing has been combined with service and cloud computing, leading to a new class of services called “Big Services”. In this new model, services can be seen as an abstract layer that hides the complexity of the processed big data. To meet users' complex and heterogeneous needs in the era of big data, service reuse is a natural and efficient means that helps orchestrating available services' operations, to provide customer on-demand big services. However different from traditional Web service composition, composing big services refers to the reuse of, not only existing high-quality services, but also high-quality data sources, while taking into account their security constraints (e.g., data provenance, threat level and data leakage). Moreover, composing heterogeneous and large-scale data-centric services faces several challenges, apart from security risks, such as the big services' high execution time and the incompatibility between providers' policies across multiple domains and clouds. Aiming to solve the above issues, we propose a scalable approach for big service composition, which considers not only the quality of reused services (QoS), but also the quality of their consumed data sources (QoD). Since the correct representation of big services requirements is the first step towards an effective composition, we first propose a quality model for big services and we quantify the data breaches using L-Severity metrics. Then to facilitate processing and mining big services' related information during composition, we exploit the strong mathematical foundation of fuzzy Relational Concept Analysis (fuzzy RCA) to build the big services' repository as a lattice family. We also used fuzzy RCA to cluster services and data sources based on various criteria, including their quality levels, their domains, and the relationships between them. Finally, we define algorithms that parse the lattice family to select and compose high-quality and secure big services in a parallel fashion. The proposed method, which is implemented on top of Spark big data framework, is compared with two existing approaches, and experimental studies proved the effectiveness of our big service composition approach in terms of QoD-aware composition, scalability, and security breaches.

    Mohamed Gharbi, Haithem Mezni

    Towards big services composition

    Web and Grid Services, 2020

    Résumé

    Recently, cloud computing has been combined with big data processing leading to a new model of services called big services. This model addresses the customers' complex requirements by reusing and aggregating existing services from various domains and delivery models, and from multiple cloud availability zones. Existing web/cloud service composition approaches are not adequate for the big service context due to many reasons, including the large volume of data, the cross-domain and cross-cloud interoperability issues, etc. Considering the aforementioned facts, we provide a solution to the big service composition issue, by taking advantage of relational concept analysis (RCA), as a clustering method, and composite particle swarm optimisation (CPSO), as an optimisation technique. RCA is used to model the big service environment, whereas CPSO helps continuously optimising the quality of big service composition. The implementation and experimental studies on our approach have proven its feasibility and efficiency.

  • Imen Khamassi, Moamar Sayed-Mouchaweh, Moez Hammami, Khaled Ghedira

    A New Combination of Diversity Techniques in Ensemble Classifiers for Handling Complex Concept Drift

    book-chapter in learning from data streams in evolving environments, pp 39-61. Springer International Publishing, January 2019., 2019

    Résumé

    Recent advances in Computational Intelligent Systems have focused on addressing complex problems related to the dynamicity of the environments. Generally in dynamic environments, data are presented as streams that may evolve over time and this is known by concept drift. Handling concept drift through ensemble classifiers has received a great interest in last decades. The success of these ensemble methods relies on their diversity. Accordingly, various diversity techniques can be used like block-based dataweighting-data or filtering-data. Each of these diversity techniques is efficient to handle certain characteristics of drift. However, when the drift is complex, they fail to efficiently handle it. Complex drifts may present a mixture of several characteristics (speed, severity, influence zones in the feature space, etc.) which may vary over time. In this case, drift handling is more complicated and requires new detection and updating tools. For this purpose, a new ensemble approach, namely EnsembleEDIST2, is presented. It combines the three diversity techniques in order to take benefit from their advantages and outperform their limits. Additionally, it makes use of EDIST2, as drift detection mechanism, in order to monitor the ensemble’s performance and detect changes. EnsembleEDIST2 was tested through different scenarios of complex drift generated from synthetic and real datasets. This diversity combination allows EnsembleEDIST2 to outperform similar ensemble approaches in terms of accuracy rate, and present stable behaviors in handling different scenarios of complex drift.

  • Wissem Inoubli, Sabeur Aridhi, Haithem Mezni, Mondher Maddouri, Engelbert Mephu Nguifo

    An experimental survey on big data frameworks

    Future Generation Computer Systems, 2018

    Résumé

    Recently, increasingly large amounts of data are generated from a variety of sources.Existing data processing technologies are not suitable to cope with the huge amounts of generated data. Yet, many research works focus on Big Data, a buzzword referring to the processing of massive volumes of (unstructured) data. Recently proposed frameworks for Big Data applications help to store, analyze and process the data. In this paper, we discuss the challenges of Big Data and we survey existing Big Data frameworks. We also present an experimental evaluation and a comparative study of the most popular Big Data frameworks with several representative batch and iterative workloads. This survey is concluded with a presentation of best practices related to the use of studied frameworks in several application domains such as machine learning, graph processing and real-world applications.

    Wissem Inoubli, Sabeur Aridhi, Haithem Mezni, Mondher Maddouri, Engelbert Mephu Nguifo

    A Comparative Study on Streaming Frameworks for Big Data

    VLDB 2018-44th International Conference on Very Large Data Bases: Workshop LADaS-Latin American Data Science, 2018

    Résumé

    Recently, increasingly large amounts of data are generated from a variety of sources. Existing data processing technologies are not suitable to cope with the huge amounts of generated data. Yet, many research works focus on streaming in Big Data, a task referring to the processing of massive volumes of structured/unstructured streaming data. Recently proposed streaming frameworks for Big Data applications help to store, analyze and process the continuously captured data. In this paper, we discuss the challenges of Big Data and we survey existing streaming frameworks for Big Data. We also present an experimental evaluation and a comparative study of the most popular streaming platforms.

  • Imen Khamassi, Moamar Sayed-Mouchaweh, Moez Hammami, Khaled Ghedira

    Self-Adaptive Windowing Approach for Handling Complex Concept Drift

    Cognitive Computation Journal, Springer. vol.7, pages 772–790, issue.6 (2015), Evolving Systems, Springer-Verlag Berlin Heidelberg 2016, 2015

    Résumé

    Detecting changes in data streams attracts major attention in cognitive computing systems. The challenging issue is how to monitor and detect these changes in order to preserve the model performance during complex drifts. By complex drift, we mean a drift that presents many characteristics in the sometime. The most challenging complex drifts are gradual continuous drifts, where changes are only noticed during a long time period. Moreover, these gradual drifts may also be local, in the sense that they may affect a little amount of data, and thus make the drift detection more complicated. For this purpose, a new drift detection mechanism, EDIST2, is proposed in order to deal with these complex drifts. EDIST2 monitors the learner performance through a self-adaptive window that is autonomously adjusted through a statistical hypothesis test. This statistical test provides theoretical guarantees, regarding the false alarm rate, which were experimentally confirmed. EDIST2 has been tested through six synthetic datasets presenting different kinds of complex drift, and five real-world datasets. Encouraging results were found, comparing to similar approaches, where EDIST2 has achieved good accuracy rate in synthetic and real-world datasets and has achieved minimum delay of detection and false alarm rate.