模型:
zhayunduo/roberta-base-stocktwits-finetuned
该模型是在来自stocktwits的3200000条评论上对roberta-base模型进行微调的,用户标记了'Bullish'或'Bearish'标签。
在推断API上尝试一些个人投资者可能在投资论坛上说的话,例如尝试'red'和'green'。
| Train loss | Validation loss | Validation accuracy | |
|---|---|---|---|
| epoch1 | 0.3495 | 0.2956 | 0.8679 |
| epoch2 | 0.2717 | 0.2235 | 0.9021 |
| epoch3 | 0.2360 | 0.1875 | 0.9210 |
| epoch4 | 0.2106 | 0.1603 | 0.9343 |
from transformers import RobertaForSequenceClassification, RobertaTokenizer
from transformers import pipeline
import pandas as pd
import emoji
# the model was trained upon below preprocessing
def process_text(texts):
# remove URLs
texts = re.sub(r'https?://\S+', "", texts)
texts = re.sub(r'www.\S+', "", texts)
# remove '
texts = texts.replace(''', "'")
# remove symbol names
texts = re.sub(r'(\#)(\S+)', r'hashtag_\2', texts)
texts = re.sub(r'(\$)([A-Za-z]+)', r'cashtag_\2', texts)
# remove usernames
texts = re.sub(r'(\@)(\S+)', r'mention_\2', texts)
# demojize
texts = emoji.demojize(texts, delimiters=("", " "))
return texts.strip()
tokenizer_loaded = RobertaTokenizer.from_pretrained('zhayunduo/roberta-base-stocktwits-finetuned')
model_loaded = RobertaForSequenceClassification.from_pretrained('zhayunduo/roberta-base-stocktwits-finetuned')
nlp = pipeline("text-classification", model=model_loaded, tokenizer=tokenizer_loaded)
sentences = pd.Series(['just buy','just sell it',
'entity rocket to the sky!',
'go down','even though it is going up, I still think it will not keep this trend in the near future'])
# sentences = list(sentences.apply(process_text)) # if input text contains https, @ or # or $ symbols, better apply preprocess to get a more accurate result
sentences = list(sentences)
results = nlp(sentences)
print(results) # 2 labels, label 0 is bearish, label 1 is bullish