Update README.md
Browse files
README.md
CHANGED
|
@@ -13,4 +13,169 @@ metrics:
|
|
| 13 |
- precision
|
| 14 |
- recall
|
| 15 |
library_name: keras
|
| 16 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 13 |
- precision
|
| 14 |
- recall
|
| 15 |
library_name: keras
|
| 16 |
+
---
|
| 17 |
+
# Model Card for Model ID
|
| 18 |
+
|
| 19 |
+
## Model Details
|
| 20 |
+
|
| 21 |
+
### Model Description
|
| 22 |
+
|
| 23 |
+
The model is a neural network architecture specifically designed to detect domain generation algorithm (DGA) domains. DGA domains are often used by malware to generate random domain names for communication. This model ensures effective identification and classification of such domains.
|
| 24 |
+
|
| 25 |
+
Key Features:
|
| 26 |
+
|
| 27 |
+
### 1. Input Layer: Accepts sequences with a maximum length of 45 characters.
|
| 28 |
+
|
| 29 |
+
### 2. Embedding Layer: Converts input characters into dense vector representations of size 100.
|
| 30 |
+
|
| 31 |
+
### 3. Conv1D Layer: Applies 256 filters with a kernel size of 4 to extract features, followed by ReLU activation for non-linearity.
|
| 32 |
+
|
| 33 |
+
### 4. Flatten Layer: Transforms the multi-dimensional tensor into a 1D array for further processing.
|
| 34 |
+
|
| 35 |
+
### 5. Dense Layer 1: Contains 512 units with ReLU activation to learn high-level patterns.
|
| 36 |
+
|
| 37 |
+
### 6. Dense Layer 2: A final layer with 1 unit and a sigmoid activation for binary classification, predicting whether a domain is generated by a DGA.
|
| 38 |
+
|
| 39 |
+
- **Developed by:** [noobpk](https://github.com/noobpk/)
|
| 40 |
+
|
| 41 |
+
## Uses
|
| 42 |
+
|
| 43 |
+
The model is designed to be used directly for identifying domains generated by domain generation algorithms (DGAs), which are often associated with malicious software. This includes applications in:
|
| 44 |
+
|
| 45 |
+
### Direct Use
|
| 46 |
+
|
| 47 |
+
- Cybersecurity Tools: Integrating the model into systems to detect and block potentially harmful domains.
|
| 48 |
+
|
| 49 |
+
- Network Traffic Monitoring: Assisting in real-time analysis to identify abnormal patterns.
|
| 50 |
+
|
| 51 |
+
- Educational and Research Purposes: Understanding DGA behavior and improving algorithms for detecting them.
|
| 52 |
+
|
| 53 |
+
### Out-of-Scope Use
|
| 54 |
+
|
| 55 |
+
The model should not be used for:
|
| 56 |
+
|
| 57 |
+
- Critical Systems: Reliance on the model without complementary systems for sensitive environments, such as financial transactions.
|
| 58 |
+
|
| 59 |
+
- Malicious Intent: Using the model to target or exploit cybersecurity vulnerabilities.
|
| 60 |
+
|
| 61 |
+
- Non-DGA Detection: The model may perform poorly in tasks unrelated to DGA detection, such as detecting phishing or legitimate domain validation.
|
| 62 |
+
|
| 63 |
+
## Bias, Risks, and Limitations
|
| 64 |
+
|
| 65 |
+
- False Positives/Negatives: The model may misclassify legitimate domains or fail to identify certain DGA domains, leading to potential disruptions or security risks.
|
| 66 |
+
|
| 67 |
+
- Bias in Training Data: If the training data is not representative of all DGA and non-DGA domains, the model's effectiveness may vary across different networks and datasets.
|
| 68 |
+
|
| 69 |
+
- Dependency on Sequence Length: The model is optimized for input sequences of 45 characters; domains outside this range may affect its performance.
|
| 70 |
+
|
| 71 |
+
- Evolving Threats: As DGAs develop more sophisticated techniques, the model may require frequent retraining to adapt to new patterns.
|
| 72 |
+
|
| 73 |
+
### Recommendations
|
| 74 |
+
|
| 75 |
+
To minimize risks:
|
| 76 |
+
|
| 77 |
+
- Regularly update the model with new data reflecting evolving DGA techniques.
|
| 78 |
+
|
| 79 |
+
- Employ this model alongside other cybersecurity measures to enhance its effectiveness.
|
| 80 |
+
|
| 81 |
+
- Validate the model's output in diverse network environments to ensure reliability.
|
| 82 |
+
|
| 83 |
+
## How to Get Started with the Model
|
| 84 |
+
|
| 85 |
+
Use the code below to get started with the model.
|
| 86 |
+
|
| 87 |
+
```
|
| 88 |
+
import os
|
| 89 |
+
os.environ["KERAS_BACKEND"] = "tensorflow"
|
| 90 |
+
|
| 91 |
+
from huggingface_hub import hf_hub_download
|
| 92 |
+
from tensorflow.keras.models import load_model
|
| 93 |
+
from keras.preprocessing.sequence import pad_sequences
|
| 94 |
+
|
| 95 |
+
def load_modeler():
|
| 96 |
+
local_model_path = hf_hub_download(
|
| 97 |
+
repo_id="noobpk/dga-detection",
|
| 98 |
+
filename="model.h5"
|
| 99 |
+
)
|
| 100 |
+
return load_model(local_model_path)
|
| 101 |
+
|
| 102 |
+
model = load_modeler()
|
| 103 |
+
|
| 104 |
+
valid_characters = "$abcdefghijklmnopqrstuvwxyz0123456789-_."
|
| 105 |
+
tokens = {char: idx for idx, char in enumerate(valid_characters)}
|
| 106 |
+
|
| 107 |
+
if __name__ == "__main__":
|
| 108 |
+
payload = input("Enter payload: ")
|
| 109 |
+
print("Processing payload...")
|
| 110 |
+
|
| 111 |
+
# Convert domain to lowercase and encode it
|
| 112 |
+
payload_encoded = [tokens[char] for char in payload.lower() if char in tokens]
|
| 113 |
+
|
| 114 |
+
# Pad and truncate the sequence
|
| 115 |
+
domain_encoded = pad_sequences([payload_encoded], maxlen=45, padding='post', truncating='post')
|
| 116 |
+
|
| 117 |
+
# Make prediction
|
| 118 |
+
prediction = model.predict(domain_encoded)
|
| 119 |
+
accuracy = float(prediction[0][0] * 100)
|
| 120 |
+
print(f"Accuracy: {accuracy}")
|
| 121 |
+
```
|
| 122 |
+
|
| 123 |
+
## Training Details
|
| 124 |
+
|
| 125 |
+
### Training Data
|
| 126 |
+
|
| 127 |
+
Dataset: [dga-detection](https://huggingface.co/datasets/noobpk/dga-detection)
|
| 128 |
+
|
| 129 |
+
- Using 70% for training data
|
| 130 |
+
|
| 131 |
+
## Evaluation
|
| 132 |
+
|
| 133 |
+
<!-- This section describes the evaluation protocols and provides the results. -->
|
| 134 |
+
|
| 135 |
+
### Testing Data, Factors & Metrics
|
| 136 |
+
|
| 137 |
+
#### Testing Data
|
| 138 |
+
|
| 139 |
+
Dataset: [dga-detection](https://huggingface.co/datasets/noobpk/dga-detection)
|
| 140 |
+
|
| 141 |
+
- Using 30% for training data
|
| 142 |
+
|
| 143 |
+
#### Metrics
|
| 144 |
+
|
| 145 |
+
- precision
|
| 146 |
+
- f1-score
|
| 147 |
+
- recall
|
| 148 |
+
- accuracy
|
| 149 |
+
|
| 150 |
+
### Results
|
| 151 |
+
|
| 152 |
+
```
|
| 153 |
+
29704/29704 [==============================] - 82s 3ms/step - loss: 0.0249 - accuracy: 0.9917
|
| 154 |
+
29704/29704 [==============================] - 54s 2ms/step
|
| 155 |
+
Accuracy: 99.17%
|
| 156 |
+
precision recall f1-score support
|
| 157 |
+
|
| 158 |
+
0 0.99 0.99 0.99 478072
|
| 159 |
+
1 0.99 0.99 0.99 472448
|
| 160 |
+
|
| 161 |
+
accuracy 0.99 950520
|
| 162 |
+
macro avg 0.99 0.99 0.99 950520
|
| 163 |
+
weighted avg 0.99 0.99 0.99 950520
|
| 164 |
+
```
|
| 165 |
+
|
| 166 |
+
|
| 167 |
+
### Compute Infrastructure
|
| 168 |
+
|
| 169 |
+
- Google Colab Pro
|
| 170 |
+
|
| 171 |
+
#### Software
|
| 172 |
+
|
| 173 |
+
- Jupiter Notebook
|
| 174 |
+
|
| 175 |
+
## Model Card Authors
|
| 176 |
+
|
| 177 |
+
[noobpk](https://github.com/noobpk/)
|
| 178 |
+
|
| 179 |
+
## Model Card Contact
|
| 180 |
+
|
| 181 |
+
[noobpk](t.me/noobpk)
|