requirements on analysis tools for "Big Data"

requirements on analysis tools for "Big Data"
especially text analysis tools, of course
jussi karlgren
stockholm nlp meetup, october 2014
text analysis on internet scale since 2008
applied primarily to media monitoring
right here in Stockholm
jussi karlgren
researcher in text stylistics and evaluation of information access systems
adjoint professor of language technology at kth
big data is not just slightly larger data
it’s not a scalability issue
situation awareness, not search
@tripbirds at instead. Ping @jonasl @jocke
Internetdagarna. #ind11 (@ Stockholm Waterfront Congre
Centre w/ 7 others)
@RonnieRitter ska på #ind11! Eller menar du den här jädra
omg, YITAS
så-gullig-marsplan? :)
and my rents caught me
@LennartBon @mansj gameon! #ind11
Gr8 work, rly gr8 work!
wos the otha
place we gt
chased out of?hehe!tht
so funny!
@per_p ja! Kul att du kommer!
där hela
the yr 5's Blir
enit idu
wos callin
ya but veckan?
u were 2 busy #in
*grins evil*
lol well.....dunno lol
i wud tap that
u lik?
thx stranger
thank u hehe :)ssssssshhhhhh! ..... i wish i wos eatin
all da cakes lol.
member. fantage_girl08 8 months ago. we hate each!
other bcuz ...
we r the sissters adn we wil cill u al bcuz u dun
ppl think you need to grow up x
beliv our story!!!!
por ke sabes tantO...
is bcuz she is dead she tried 2 come bak then and
soposedly got out ..... of 2 sexy 4 u? and all the
q's and a's are similar to each other. ...
Dragon Ball Z: Budokai Tenkaichi 4 might feature !a
character ...
-MORE MORE MORE ATTACKS, like goku for example, in
bt3 he he has two .... In yor msg
bjr, Adele!! koi29?
lut, ca va?
had a kewl day morph?
i came home 2 am ..and went up at 9
sum bastard woke me up @ 2pm
heh naah.
Irritation i h8 dat guy
phoner on the sms?
#"$%&%' ("#$%&. τήλε — !"#$ () *)+. video — %&'()%) — ,)-./$0)"*1
1)*#"), 42%$)-/ .)--+#+$/ )& 5#$)(/ 642-)&/0)423)61*#"#"/3 .2$
%/*7# 8$)&3*/$#"/3+-/3.!
8#*#-/7/2 4)19:#"*2"#"/ 3)+)-#3 /;#"3 0-/)&/ 1920-/)&/ 9*#"/()&.
*#-/7/) 06() )<81)*1$/ :-#*) 35#$2./, +14=) %/$/+)()( 3)&)6)2"
3)/&52$4)=/2 51&<=/) 4//;2. "2*2 9*#"./ /&8#$&#8/3 0)&-/+)$#"
81)*1$/ 0)6() /&8#$&#8 8#*#-/7/).!
『カイカイ・キキ(Kaikai Kiki)』を主宰し、若⼿手アーティストのプロデュースを⾏行うなど、活発な活動を展開
4 h of news feed
43171 English
15135 Chinese
9939 German
7526 Spanish
4899 French
3611 Russian
the internet is not in english any more
3167 Japanese
2879 Italian
2722 Korean
2432 Swedish
1904 Dutch
1795 Portuguese
1752 Turkish
1389 Arabic
1362 Hungarian
5 Kannada
4 Galician
3 Maltese
noise is not something you wash off
change is normal
learning, not training
it's not search any more
it's not about the needle in the haystack
it's about the shape of the field over time
it's about situation awareness, not misplaced
from features to tasks - where do we (as nlp professionals)
want to contribute?
what is our expertise?
fiddling with parameters?
formulating features?
building tools?
understanding people?
understanding language?
note 1
any model we wish to deploy in practical use must
prove its worth by improving something
how to evaluate?
what do we want to achieve?
product output quality?
better coverage?
product agility?
explanatory power?
evaluation? how? what are our objective target
functions? make them explicit!
research: explanatory height!
engineering: scalability!
industrial application: convenience and scale!
sales: revenue!
evaluation by
research: gold standards
engineering: unit tests, performance profiling, and
sales: revenue
industrial application: profit
gold standard
test subjects
use case
how can we stop our resource from becoming
a wet blanket?
note 2
difficult vs easy tasks
why use computational methods and machinery for information access?
1 amount of data is overwhelming → reduce data complexity let’s call these “simple” tasks
2 signal is weak and complex → peer closer into data
let’s call these “simple” tasks
note 3:
meteorology as a model
measure process, rather than outcome
a case in point
sentiment analysis
sentiment analysis is difficult and challenging
And the sound quality - my God!
Raymond left no room for error on his recordings and it shows.
Definitely one of the better tracks on the album.
Wow, could have been a expansion pack.
I loved The Spy Who Came In From The Cold but the movie is a bit dated in a
way the book never will be.
Meat is more environmentally friendly than seafood.
I am unsure about the feasibility of this knitting pattern.
I love the Samsung B2710 but I would not recommend it to my colleagues.
I don't know if I should call her up – I liked her when I met her last weekend.
This is true.
this is why it's fun
but is it any good?
(in terms of the above discussion)
since human emotion is (likely to be better)
represented by dimensional model, not a
categorial model, textual attitude also should
be modelled dimensionally
can this be tested with our setup?
well-established basic emotions
anger, fear, sadness, enjoyment, disgust, surprise, contempt (recent addition)
candidate basic emotions
amusement, relief, excitement, shame, pride in achievement, guilt, embarrassment, contentment, awe, sensory pleasure
example of criticism:
where is jealousy and paternal love?
technology enabler: the big data stack
a semantic base technology?
is this an example of that?
are these two the same?
has this changed? how?
what is the relation of this and that?
is this a new way of saying that?
are these or those more like this?
is this typical or strange?
can we trust this?
does the author believe this to be true?
evaluating marketing campaigns
giveaways to bloggers
traditional marketing campaign
giveaways to cosmetics subscribers
tracking violence in the world
kazakhstan stands out unreported in western
media, a violent
altercation in a court
where protestors were
sentenced to prison
sentences took place on
this week in 2012
what are now our requirements?
learn, don't expect teaching
answer the right questions
(first, formulate those questions)
embrace change, analogy, and homeosemy
model what is similar between languages not what is specific to them
aim for situation awareness, not classification as primary task
adjust evaluation metrics accordingly
measure process, not outcome on gold standard
note that sales figures are but one aspect of evaluation