This project extends the STAR recommendation system by incorporating GPT-2 into the sorting phase. The original retrieval pipeline from the STAR paper is enhanced by generating rankings using not only semantic and collaborative filtering but also GPT-2's natural language understanding. This improvement allows for better personalization of recommendations based on user interaction history.
This repository builds upon the STAR retrieval pipeline and introduces an enhanced sorting mechanism using GPT-2. The recommendation system now uses a combination of semantic similarity, collaborative filtering, and GPT-2 generated scores to rank candidate items. The modifications are made to star_retrieval.py for scoring candidates with GPT-2 and to main.py for managing the pipeline flow and evaluation. This framework effectively replicates the STAR framework from the paper, with the only modification being the use of GPT-2 in the sorting phase to improve ranking.
- Retrieval Stage: Based on the STAR framework, this stage combines semantic embeddings and collaborative filtering for the initial ranking of candidate items.
- Ranking Stage:GPT-2 is used to refine the ranking of candidate items, leveraging its natural language understanding capabilities to adjust the priority of items. The ranking process is separated into two stages: first, candidate items are recalled based on semantic and collaborative filtering, and then the ranking is refined using GPT-2.
- This framework effectively enhances recommendation quality by reproducing the core ideas of the STAR paper, with the addition of GPT-2 providing a new intelligent layer in the sorting phase.
To use GPT-2 for ranking optimization, download the necessary files from the Hugging Face GPT-2 Model Page. The required files are:
config.jsonmerges.txtpytorch_model.binvocab.json
Save these files in a folder (e.g., D:/STAR-main/GPT-2/GPT-2).
In star_retrieval.py, update the GPT-2 model path:
gpt2_model_path = "D:/STAR-main/GPT-2/GPT-2" # Set the GPT-2 model path here
retrieval = STARRetrieval(
semantic_weight=0.5,
temporal_decay=0.7,
history_length=3,
gpt2_model_path=gpt2_model_path # Pass GPT-2 model path
)Ensure the path points to the directory containing the four files you downloaded.
In main.py, make sure the GPT-2 model path is set as follows:
gpt2_model_path = "D:/STAR-main/GPT-2/GPT-2" # Set the GPT-2 model path here
retrieval = STARRetrieval(
semantic_weight=0.5,
temporal_decay=0.7,
history_length=3,
gpt2_model_path=gpt2_model_path # Pass GPT-2 model path
)Now, when you run the project, GPT-2 will be used to provide additional scoring based on the user's history, optimizing the final ranking.
-
Install conda on your PC/Lab PC, and use conda to create a python 3.11 environment.
conda create -n envname python=3.11
-
Install Python dependencies via Poetry:
poetry install
-
Run the main pipeline:
poetry run python src/main.py
Final Results:
Results for Beauty dataset:
------------------------------
Metric Score
------------------------------
hit@10 0.3744
hit@5 0.3493
ndcg@10 0.2086
ndcg@5 0.2005
------------------------------
Appendix:
This repository implements the retrieval pipeline from the paper STAR: A Simple Training-free Approach for Recommendations using Large Language Models. It aims to help understand how a training-free recommendation system can be built using:
- LLM embeddings for semantic similarity
- User interaction patterns for collaborative signals
- Temporal decay for recent history weighting
The embeddings are the foundation of semantic similarity:
class ItemEmbeddingGenerator:
def create_embedding_input(self, item_data: Dict) -> TextEmbeddingInput:
# Creates rich text prompts including:
# - Full item description
# - Title
# - Category hierarchy
# - Brand (if not ASIN-like)
# - Price and sales rankKey implementation details:
- Uses Vertex AI's
text-embedding-005model (768 dimensions). [I rewrote ItemEmbeddingGenerator by using the BGE Model] - Excludes IDs/URLs to avoid trivial matching
- Preserves complete metadata structure
The core scoring logic combines three components:
- Semantic Matrix (R_s):
# Compute cosine similarities between normalized embeddings
semantic_matrix = 1 - cdist(embeddings_array, embeddings_array, metric='cosine')
np.fill_diagonal(semantic_matrix, 0) # Zero out self-similarities- Collaborative Matrix (R_c) (
collaborative_relationships.py):
# Normalize by user activity sqrt
user_activity = np.sum(interaction_matrix, axis=0)
normalized = interaction_matrix / np.sqrt(user_activity)
collaborative_matrix = normalized @ normalized.T- Scoring Formula:
score = 0.0
for t, (hist_item, rating) in enumerate(zip(reversed(user_history), reversed(ratings))):
sem_sim = semantic_matrix[cand_idx, hist_idx]
collab_sim = collaborative_matrix[cand_idx, hist_idx]
combined_sim = (semantic_weight * sem_sim + (1 - semantic_weight) * collab_sim)
score += (1/n) * rating * (temporal_decay ** t) * combined_simI have integrated GPT-2 for additional scoring of candidate items based on the user's history:
- GPT-2 Integration: I create a prompt based on the user's history and ask GPT-2 to provide a score for the candidate item.
prompt = f"Given the items the user has liked: {history_str}. How would you rate this item: {candidate_item}?"
gpt2_score = self.get_gpt2_score(prompt)- This score is added with a small weight to the final recommendation score.
The code strictly maintains temporal order (temporal_utils.py):
- Sorts reviews by timestamp
- Handles duplicate timestamps
- Ensures test items are truly last in sequence
The evaluation (evaluation_metrics.py) matches the paper's setup:
- Leave-last-out evaluation
- Use the full dataset for evaluation
- Metrics: Hits@5/10, NDCG@5/10
-
Download the Stanford SNAP 5-core Amazon datasets using
download_data.py. For example:poetry run python download_data.py --category beauty
This downloads
reviews_Beauty_5.json.gzandmeta_Beauty.json.gzinto thedata/folder. -
Check data with
check_data.py:poetry run python check_data.py
This prints the first few lines and verifies the JSON parse.
Note: These files named
reviews_Beauty_5.json.gzetc. are already 5-core datasets. The code still enforces ≥5 interactions, but typically no users/items are removed since the data is already filtered.
-
Install Python dependencies via Poetry:
poetry install
-
Run the main pipeline:
poetry run python src/main.py
This:
- Loads reviews and metadata,
- Sorts each user’s reviews by timestamp (fixing potential out-of-order entries),
- Creates or loads item embeddings,
- Computes the semantic and collaborative matrices,
- Runs GPT-2 scoring on candidates,
- Splits data into train/val/test in a leave-last-out manner,
- Runs evaluation with 99 negative samples for each user’s test item,
- Prints final Hits@K, NDCG@K metrics.
-
Data Quality Matters
- Use
DataQualityCheckerto verify metadata richness - Check for duplicate timestamps
- Verify chronological ordering
- Use
-
Embedding Generation
- Include all relevant metadata for rich embeddings
- Avoid ID/URL information that could leak
- Use consistent field ordering in prompts
-
Matrix Computation
- Normalize embeddings before similarity
- Proper user activity normalization for collaborative
- Zero out diagonal elements
-
Common Issues
- Future item leakage in negative sampling
- Timestamp ordering issues
- Inadequate metadata in prompts
# Retrieval parameters (star_retrieval.py)
semantic_weight = 0.5 # Weight between semantic/collaborative
temporal_decay = 0.7 # Decay factor for older items
history_length = 3 # Number of recent items to use
# Evaluation parameters (evaluation_metrics.py)
k_values = [5, 10] # Top-k for metricsThe code provides detailed statistics:
Semantic Matrix Statistics:
- mean_sim: Average semantic similarity
- sparsity: Fraction of zero elements
- min/max_sim: Similarity range
Collaborative Matrix Statistics:
- mean_nonzero: Average co-occurrence strength
- sparsity: Interaction density
These help diagnose if the embeddings or collaborative signals are working as expected.
Final Results:
Results for Beauty dataset:
------------------------------
Metric Score
------------------------------
hit@10 0.3744
hit@5 0.3493
ndcg@10 0.2086
ndcg@5 0.2005
------------------------------
See beauty_results.md for the results on the Beauty dataset.
See Application Data Specification for how to prepare your own data.
@article{lee2024star,
title={STAR: A Simple Training-free Approach for Recommendations using Large Language Models},
author={Lee, Dong-Ho and Kraft, Adam and Jin, Long and Mehta, Nikhil and Xu, Taibai and Hong, Lichan and Chi, Ed H. and Yi, Xinyang},
journal={arXiv preprint arXiv:2410.16458},
year={2024}
}