Skip to content

vsubbian/Cynthetic-Time

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 

Repository files navigation

Cynthetic-Time: A Synthetic Data Generator for Crafting Multivariate Time-Series Data with Diverse Patterns and Missingness Mechanisms

In this document, we will provide a comprehensive overview of each function within the Python script and elaborate on their contributions to the synthetic multivariate time-series data generation process. This process is tailored to generate samples exhibiting multiple patterns while addressing a wide range of characteristics inherent to multivariate time-series data. To accomplish this, the synthetic data generation process employs the following functions:

Generating multivariate time-series data

The function generate_multivariate_time_series in the Python script generates multivariate time series data. This function creates multivariate time series data by iterating over the list of patterns and applying each pattern to a variable in the time series. The resulting data is then multiplied by the correlation matrix to introduce the specified correlations among the variables. If heteroscedasticity is enabled, the patterns are scaled by a random factor to add complexity to the generated data.

Introducing correlations between different time-series variables

The correlation_function plays a crucial role in generating realistic multivariate time-series data by introducing correlations between different time-series variables. Simulating real-world scenarios, where time-series variables often exhibit correlations due to underlying relationships, trends, or shared influences, is essential for evaluating the performance of time-series-based machine learning methodologies. The function takes a single input, M, which represents the number of variables in the multivariate time-series data, and outputs a symmetric MxM correlation matrix representing the pairwise correlations between the M time-series variables. The process involves generating an M x M matrix of random values between 0 and 1, making the matrix symmetric by adding its transpose and dividing the result by 2, and setting the diagonal elements of the matrix to 1 to represent the perfect correlation between a variable and itself. This matrix construction ensures that the correlation between variable i and variable j is the same as the correlation between variable j and variable i. The resulting correlations matrix is used in the generate_multivariate_time_series function to create correlated time-series data.

Making synthetic data with different patterns

We have defined five functions that generate five different patterns within the time series data, facilitating the creation of datasets containing up to five different patterns.

The pattern-generator functions are crafted to produce a variety of patterns in the time series data by employing distinct mathematical functions. These patterns form the foundation for generating multivariate time series, and their inclusion allows for the testing multivariate time-series based methods on more intricate data with diverse patterns. Below is a description of each pattern-generator function along with the specific characteristics of the patterns they produce:

  • pattern1: This function generates a sine wave pattern with a single frequency. It takes the number of time steps T as input and returns a sine wave with T points in the range [0, 2 * pi]. This pattern represents a simple periodic oscillation.
  • pattern2: This function generates a cosine wave pattern with a single frequency. It takes the number of time steps T as input and returns a cosine wave with T points in the range [0, 2 * pi]. This pattern is similar to pattern1 but is phase-shifted by pi/2, representing another simple periodic oscillation.
  • pattern3: This function generates a sine wave pattern with double frequency. It takes the number of time steps T as input and returns a sine wave with T points in the range [0, 4 * pi]. The double frequency results in twice as many oscillations within the same range as pattern1, representing a more complex periodic oscillation.
  • pattern4: This function generates a cosine wave pattern with double frequency. It takes the number of time steps T as input and returns a cosine wave with T points in the range [0, 4 * pi]. Similar to pattern3, this pattern has a higher frequency than pattern2 and represents another complex periodic oscillation with a different phase shift.
  • pattern5: This function generates a product of sine and cosine waves with a single frequency. It takes the number of time steps T as input and returns a waveform that is the product of a sine wave and a cosine wave with T points in the range [0, 2 * pi]. This pattern represents a more intricate oscillation that combines the characteristics of both sine and cosine waves.

Applying Missingness Mechanisms

In real-world time series data, missing values can occur due to various reasons, such as data entry errors, sensor malfunctions, data corruption, incomplete data collection, sampling issues, data aggregation discrepancies, intentional data omission, seasonality, and non-stationarity. Depending on the underlying cause, the missing values can exhibit different statistical properties. There are three main missingness mechanisms in time series data: Missing Completely At Random (MCAR), Missing At Random (MAR), and Missing Not At Random (MNAR). I would like to provide a more detailed explanation of the three possible missingness mechanisms in time series data with real-world clinical examples for each mechanism.

  • Missing Completely At Random (MCAR): In this mechanism, the probability of a value being missing is independent of both the observed and unobserved data. In other words, the missingness does not depend on any variable or pattern in the dataset. For example, in a real-world clinical scenario, imagine a dataset containing patients' blood pressure readings taken at various time points during their hospital stay. If some readings are missing due to random factors such as data entry errors or misplaced records, the missing data can be classified as MCAR.
  • Missing At Random (MAR): In this mechanism, the probability of a value being missing depends on the observed data but not on the unobserved data. This means that the missingness may be related to some variables in the dataset, but it is not directly related to the missing values themselves. For instance, consider a real-world clinical study in which patients' blood pressure is monitored over time. If patients with a history of smoking are more likely to miss follow-up appointments (resulting in missing blood pressure records), the missing data can be classified as MAR. The missingness is contingent on the observed variable (smoking status), but not on the unrecorded blood pressure values during the missed appointments.
  • Missing Not At Random (MNAR): In this mechanism, the probability of a value being missing is dependent on the unobserved data, specifically the missing values themselves. Handling this type of missingness is particularly challenging, as the missingness is directly connected to the values we aim to estimate. If not addressed correctly, this can result in biased estimates. A real-world clinical example is a study examining the impact of a new drug on patients' pain levels. Suppose patients experiencing higher pain levels are more prone to skipping follow-up appointments, resulting in missing pain level measurements. In this scenario, the missingness relies on the unobserved pain levels at the time of the missed appointment, which could be related to the patients' response to the drug. This makes the missing data MNAR.

The apply_missingness function introduces missing values into a given time series dataset (ts) and non-temporal data (non_temporal_sample) according to a specified missing data mechanism (MCAR, MAR, or MNAR) and a given probability of missingness (P_miss). The

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages